DESIGN AND ANALYSIS OF DISTRIBUTED ALGORITHMS
Nicola Santoro Carleton University, Ottawa, Canada
WILEY-INTERSCIENCE A...

Author:
Nicola Santoro

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

DESIGN AND ANALYSIS OF DISTRIBUTED ALGORITHMS

Nicola Santoro Carleton University, Ottawa, Canada

WILEY-INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION

DESIGN AND ANALYSIS OF DISTRIBUTED ALGORITHMS

DESIGN AND ANALYSIS OF DISTRIBUTED ALGORITHMS

Nicola Santoro Carleton University, Ottawa, Canada

WILEY-INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION

Copyright © 2007 by John Wiley & Sons, Inc. All rights reserved Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and speciﬁcally disclaim any implied warranties of merchantability or ﬁtness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of proﬁt or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com. Library of Congress Cataloging-in-Publication Data: Santoro, N. (Nicola), 1951Design and analysis of distributed algorithms / by Nicola Santoro. p. cm. – (Wiley series on parallel and distributed computing) Includes index. ISBN-13: 978-0-471-71997-7 (cloth) ISBN-10: 0-471-71997-8 (cloth) 1. Electronic data processing–Distributed processing. 2. Computer algorithms. QA76.9.D5.S26 2007 005.1–dc22 2006011214 Printed in the United States of America 10 9 8 7 6 5 4 3 2 1

I. Title.

II. Series.

To my favorite distributed environment: My children Monica, Noel, Melissa, Maya, Michela, Alvin.

CONTENTS

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xiv

1. Distributed Computing Environments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Axioms and Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Cost and Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Amount of Communication Activities . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 An Example: Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 States and Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Time and Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.2 States and Conﬁgurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Problems and Solutions () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8 Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8.1 Levels of Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8.2 Types of Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9 Technical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9.1 Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9.2 Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9.3 Communication Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.10 Summary of Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.11 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.12 Exercises, Problems, and Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.12.1 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.12.2 Answers to Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 4 4 5 6 9 9 10 10 14 14 16 17 19 19 21 22 22 23 24 25 25 26 26 27

2. Basic Problems And Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Cost of Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Broadcasting in Special Networks . . . . . . . . . . . . . . . . . . . . . . . . . .

29 29 29 30 32 vii

viii

CONTENTS

2.2 Wake-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Generic Wake-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Wake-Up in Special Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Depth-First Traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Hacking () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Traversal in Special Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Considerations on Traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Practical Implications: Use a Subnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Constructing a Spanning Tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 SPT Construction with a Single Initiator: Shout . . . . . . . . . . . . . . 2.5.2 Other SPT Constructions with Single Initiator. . . . . . . . . . . . . . . . 2.5.3 Considerations on the Constructed Tree . . . . . . . . . . . . . . . . . . . . . 2.5.4 Application: Better Traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.5 Spanning-Tree Construction with Multiple Initiators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.6 Impossibility Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.7 SPT with Initial Distinct Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Computations in Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Saturation: A Basic Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Minimum Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.3 Distributed Function Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.4 Finding Eccentricities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.5 Center Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.6 Other Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.7 Computing in Rooted Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Summary of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.2 Summary of Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Exercises, Problems, and Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.2 Problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.3 Answers to Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36 36 37 41 42 44 49 50 51 52 53 58 60 62

3. Election . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Impossibility Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Additional Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Solution Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Election in Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Election in Rings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 All the Way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99 99 99 100 101 102 104 105

62 63 65 70 71 74 76 78 81 84 85 89 89 90 90 91 91 95 95

CONTENTS

ix

3.3.2 As Far As It Can . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Controlled Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Electoral Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.5 Stages with Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.6 Alternating Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.7 Unidirectional Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.8 Limits to Improvements () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.9 Summary and Lessons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Election in Mesh Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Meshes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Tori . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Election in Cube Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Oriented Hypercubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Unoriented Hypercubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Election in Complete Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Stages and Territory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Surprising Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.3 Harvesting the Communication Power . . . . . . . . . . . . . . . . . . . . . Election in Chordal Rings () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Chordal Rings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.2 Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Universal Election Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.1 Mega-Merger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.2 Analysis of Mega-Merger. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.3 YO-YO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.4 Lower Bounds and Equivalences . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises, Problems, and Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10.3 Answers to Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

109 115 122 127 130 134 150 157 158 158 161 166 166 174 174 174 177 180 183 183 184 185 185 193 199 209 212 214 214 220 222

4. Message Routing and Shortest Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Shortest Path Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Gossiping the Network Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Iterative Construction of Routing Tables . . . . . . . . . . . . . . . . . . . 4.2.3 Constructing Shortest-Path Spanning Tree . . . . . . . . . . . . . . . . . 4.2.4 Constructing All-Pairs Shortest Paths . . . . . . . . . . . . . . . . . . . . . 4.2.5 Min-Hop Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.6 Suboptimal Solutions: Routing Trees . . . . . . . . . . . . . . . . . . . . . . 4.3 Coping with Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Adaptive Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

225 225 226 226 228 230 237 240 250 253 253

3.4

3.5

3.6

3.7

3.8

3.9 3.10

x

CONTENTS

4.3.2 Fault-Tolerant Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 On Correctness and Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Routing in Static Systems: Compact Tables . . . . . . . . . . . . . . . . . . . . . . 4.4.1 The Size of Routing Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Interval Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Exercises, Problems, and Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.3 Answers to Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

255 259 261 261 262 267 269 269 274 274

5. Distributed Set Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Distributed Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Selection in a Small Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Simple Case: Selection Among Two Sites . . . . . . . . . . . . . . . . . . 5.2.4 General Selection Strategy: RankSelect . . . . . . . . . . . . . . . . . . . . 5.2.5 Reducing the Worst Case: ReduceSelect. . . . . . . . . . . . . . . . . . . . 5.3 Sorting a Distributed Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Distributed Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Special Case: Sorting on a Ordered Line . . . . . . . . . . . . . . . . . . . 5.3.3 Removing the Topological Constraints: Complete Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Basic Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.5 Efﬁcient Sorting: SelectSort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.6 Unrestricted Sorting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Distributed Sets Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Operations on Distributed Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Local Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Local Evaluation () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.4 Global Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.5 Operational Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Exercises, Problems, and Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.3 Answers to Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

277 277 279 279 280 282 287 292 297 297 299

6. Synchronous Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Synchronous Distributed Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Fully Synchronous Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

333 333 333

303 306 309 312 315 315 317 319 322 323 323 324 324 329 329

CONTENTS

xi

6.1.2 Clocks and Unit of Time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Communication Delays and Size of Messages . . . . . . . . . . . . . . 6.1.4 On the Unique Nature of Synchronous Computations . . . . . . . . 6.1.5 The Cost of Synchronous Protocols . . . . . . . . . . . . . . . . . . . . . . . . Communicators, Pipeline, and Transformers . . . . . . . . . . . . . . . . . . . . . 6.2.1 Two-Party Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Min-Finding and Election: Waiting and Guessing . . . . . . . . . . . . . . . . . 6.3.1 Waiting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Guessing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Double Wait: Integrating Waiting and Guessing . . . . . . . . . . . . . Synchronization Problems: Reset, Unison, and Firing Squad . . . . . . . 6.4.1 Reset / Wake-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Unison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Firing Squad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises, Problems, and Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.3 Answers to Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

334 336 336 342 343 344 353 357 360 360 370 378 385 386 387 389 391 392 392 398 400

7. Computing in Presence of Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Faults and Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Modelling Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.3 Topological Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.4 Fault Tolerance, Agreement, and Common Knowledge . . . . . . 7.2 The Crushing Impact of Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Node Failures: Single-Fault Disaster . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Consequences of the Single Fault Disaster . . . . . . . . . . . . . . . . . . 7.3 Localized Entity Failures: Using Synchrony . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Synchronous Consensus with Crash Failures . . . . . . . . . . . . . . . . 7.3.2 Synchronous Consensus with Byzantine Failures . . . . . . . . . . . . 7.3.3 Limit to Number of Byzantine Entities for Agreement . . . . . . . 7.3.4 From Boolean to General Byzantine Agreement. . . . . . . . . . . . . 7.3.5 Byzantine Agreement in Arbitrary Graphs . . . . . . . . . . . . . . . . . . 7.4 Localized Entity Failures: Using Randomization. . . . . . . . . . . . . . . . . . 7.4.1 Random Actions and Coin Flips . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Randomized Asynchronous Consensus: Crash Failures . . . . . . 7.4.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

408 408 408 410 413 415 417 417 424 425 426 430 435 438 440 443 443 444 449

6.2

6.3

6.4

6.5 6.6

xii

CONTENTS

7.5 Localized Entity Failures: Using Fault Detection . . . . . . . . . . . . . . . . . 7.5.1 Failure Detectors and Their Properties . . . . . . . . . . . . . . . . . . . . . 7.5.2 The Weakest Failure Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Localized Entity Failures: Pre-Execution Failures . . . . . . . . . . . . . . . . . 7.6.1 Partial Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.2 Example: Election in Complete Network . . . . . . . . . . . . . . . . . . . 7.7 Localized Link Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.1 A Tale of Two Synchronous Generals . . . . . . . . . . . . . . . . . . . . . . 7.7.2 Computing With Faulty Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.4 Considerations on Localized Entity Failures . . . . . . . . . . . . . . . . 7.8 Ubiquitous Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.1 Communication Faults and Agreement . . . . . . . . . . . . . . . . . . . . . 7.8.2 Limits to Number of Ubiquitous Faults for Majority . . . . . . . . . 7.8.3 Unanimity in Spite of Ubiquitous Faults . . . . . . . . . . . . . . . . . . . . 7.8.4 Tightness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10 Exercises, Problems, and Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.3 Answers to Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

449 450 452 454 454 455 457 458 461 466 466 467 467 468 475 485 486 488 488 492 493

8. Detecting Stable Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Deadlock Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Deadlock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Detecting Deadlock: Wait-for Graph . . . . . . . . . . . . . . . . . . . . . . . 8.2.3 Single-Request Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.4 Multiple-Requests Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.5 Dynamic Wait-for Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.6 Other Requests Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Global Termination Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 A Simple Solution: Repeated Termination Queries . . . . . . . . . . 8.3.2 Improved Protocols: Shrink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Global Stable Property Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 General Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Time Cuts and Consistent Snapshots . . . . . . . . . . . . . . . . . . . . . . . 8.4.3 Computing A Consistent Snapshot . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.4 Summary: Putting All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

500 500 500 500 501 503 505 512 516 518 519 523 525 526 526 527 530 531 532

CONTENTS

xiii

8.6 Exercises, Problems, and Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.3 Answers to Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

534 534 536 538

9. Continuous Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Keeping Virtual Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Virtual Time and Causal Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 Causal Order: Counter Clocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.3 Complete Causal Order: Vector Clocks . . . . . . . . . . . . . . . . . . . . . 9.2.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Distributed Mutual Exclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 A Simple And Efﬁcient Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.3 Traversing the Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.4 Managing a Distributed Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.5 Decentralized Permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.6 Mutual Exclusion in Complete Graphs: Quorum . . . . . . . . . . . . 9.3.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Deadlock: System Detection and Resolution . . . . . . . . . . . . . . . . . . . . . 9.4.1 System Detection and Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.2 Detection and Resolution in Single-Request Systems . . . . . . . . 9.4.3 Detection and Resolution in Multiple-Requests Systems . . . . . 9.5 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Exercises, Problems, and Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.3 Answers to Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

541 541 542 542 544 545 548 549 549 550 551 554 559 561 564 566 566 567 568 569 570 570 572 573

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

577

PREFACE

The computational universe surrounding us is clearly quite different from that envisioned by the designers of the large mainframes of half a century ago. Even the subsequent most futuristic visions of supercomputing and of parallel machines, which have guided the research drive and absorbed the research funding for so many years, are far from today’s computational realities. These realities are characterized by the presence of communities of networked entities communicating with each other, cooperating toward common tasks or the solution of a shared problem, and acting autonomously and spontaneously. They are distributed computing environments. It has been from the ﬁelds of network and of communication engineering that the seeds of what we now experience have germinated. The growth in understanding has occurred when computer scientists (initially very few) started to become aware of and study the computational issues connected with these new network-centric realities. The internet, the web, and the grids are just examples of these environments. Whether over wired or wireless media, whether by static or nomadic code, computing in such environments is inherently decentralized and distributed. To compute in distributed environments one must understand the basic principles, the fundamental properties, the available tools, and the inherent limitations. This book focuses on the algorithmics of distributed computing; that is, on how to solve problems and perform tasks efﬁciently in a distributed computing environment. Because of the multiplicity and variety of distributed systems and networked environments and their widespread differences, this book does not focus on any single one of them. Rather it describes and employes a distributed computing universe that captures the nature and basic structure of those systems (e.g., distributed operating systems, data communication networks, distributed databases, transaction processing systems, etc.), allowing us to discard or ignore the system-speciﬁc details while identifying the general principles and techniques. This universe consists of a ﬁnite collection of computational entities communicating by means of messages in order to achieve a common goal; for example, to perform a given task, to compute the solution to a problem, to satisfy a request either from the user (i.e., outside the environment) or from other entities. Although each entity is capable of performing computations, it is the collection 1

Incredibly, the terms “distributed systems” and “distributed computing” have been for years highjacked and (ab)used to describe very limited systems and low-level solutions (e.g., client server) that have little to do with distributed computing.

xv

xvi

PREFACE

of all these entities that together will solve the problem or ensure that the task is performed. In this universe, to solve a problem, we must discover and design a distributed algorithm or protocol for those entities: A set of rules that specify what each entity has to do. The collective but autonomous execution of those rules, possibly without any supervision or synchronization, must enable the entities to perform the desired task to solve the problem. In the design process, we must ensure both correctness (i.e., the protocol we design indeed solves the problem) and efﬁciency (i.e., the protocol we design has a “small” cost). As the title says, this book is on the Design and Analysis of Distributed Algorithms. Its goal is to enable the reader to learn how to design protocols to solve problems in a distributed computing environment, not by listing the results but rather by teaching how they can be obtained. In addition to the “how” and “why” (necessary for problem solution, from basic building blocks to complex protocol design), it focuses on providing the analytical tools and skills necessary for complexity evaluation of designs. There are several levels of use of the book. The book is primarily a seniorundergraduate and graduate textbook; it contains the material for two one-term courses or alternatively a full-year course on Distributed Algorithms and Protocols, Distributed Computing, Network Computing, or Special Topics in Algorithms. It covers the “distributed part” of a graduate course on Parallel and Distributed Computing (the chapters on Distributed Data, Routing, and Synchronous Computing, in particular), and it is the theoretical companion book for a course in Distributed Systems, Advanced Operating Systems, or Distributed Data Processing. The book is written for the students from the students’ point of view, and it follows closely a well deﬁned teaching path and method (the “course”) developed over the years; both the path and the method become apparent while reading and using the book. It also provides a self-contained, self-directed guide for system-protocol designers and for communication software and engineers and developers, as well as for researchers wanting to enter or just interested in the area; it enables hands-on, headon, and in-depth acquisition of the material. In addition, it is a serious sourcebook and referencebook for investigators in distributed computing and related areas. Unlike the other available textbooks on these subjects, the book is based on a very simple fully reactive computational model. From a learning point of view, this makes the explanations clearer and readers’ comprehension easier. From a teaching point of view, this approach provides the instructor with a natural way to present otherwise difﬁcult material and to guide the students through, step by step. The instructors themselves, if not already familiar-with the material or with the approach, can achieve proﬁciency quickly and easily. All protocols in the textbook as well as those designed by the students as part of the exercises are immediately programmable. Hence, the subtleties of actual implementation can be employed to enhance the understanding of the theoretical 2

An open source Java-based engine, DisJ, provides the execution and visualization environment for our reactive protocols.

PREFACE

xvii

design principles; furthermore, experimental analysis (e.g., performance evaluation and comparison) can be easily and usefully integrated in the coursework expanding the analytical tools. The book is written so to require no prerequisites other than standard undergraduate knowledge of operating systems and of algorithms. Clearly, concurrent or prior knowledge of communication networks, distributed operating systems or distributed transaction systems would help the reader to ground the material of this course into some practical application context; however, none is necessary. The book is structured into nine chapters of different lengths. Some are focused on a single problem, others on a class of problems. The structuring of the written material into chapters could have easily followed different lines. For example, the material of election and of mutual exclusion could have been grouped together in a chapter on Distributed Control. Indeed, these two topics can be taught one after the other: Although missing an introduction, this “hidden” chapter is present in a distributed way. An important “hidden” chapter is Chapter 10 on Distributed Graph Algorithms whose content is distributed throughout the book: Spanning-Tree Construction (Section 2.5), Depth-First Traversal (Section 2.3.1), Breadth-First Spanning Tree (Section 4.2.5), Minimum-Cost Spanning Tree (Section 3.8.1), Shortest Paths (Section 4.2.3), Centers and medians (Section 2.6), Cycle and Knot Detection (Section 8.2). The suggested prerequisite structure of the chapters is shown in Figure 1. As suggested by the ﬁgure, the ﬁrst three chapters should be covered sequentially and before the other material. There are only two other prerequisite relationships. The relationship between Synchronous Compution (Chapter 6) and Computing in Presence of Faults (Chapter 7) is particular. The recommended sequencing is in fact the following: Sections 7.1– 7.2 (providing the strong motivation for synchronous computing), Chapter 6 (describing fault-free synchronous computing) and the rest of Chapter 7 (dealing with fault-tolerant synchronous computing as well as other issues). The other suggested

Figure 1: Prerequisite structure of the chapters.

xviii

PREFACE

prerequisite structure is that the topic of Stable Properties (Chapter 8) be handled before that of Continuous Computations (Chapter 9). Other than that, the sections can be mixed and matched depending on the instructor’s preferences and interests. An interesting and popular sequence for a one-semester course is given by Chapters 1–6. A more conventional one-semester sequence is provided by Chapters 1–3 and 6–9. The symbol () after a section indicates noncore material. In connection with Exercises and Problems the symbol () denotes difﬁculty (the more the symbols, the greater the difﬁculty). Several important topics are not included in this edition of the book. In particular, this edition does not include algorithms on distributed coloring, on minimal independent sets, on self-stabilization, as well as on Sense of Direction. By design, this book does not include distributed computing in the shared memory model, focusing entirely on the message-passing paradigm. This book has evolved from the teaching method and the material I have designed for the fourth-year undergraduate course Introduction to Distributed Computing and for the graduate course Principles of Distributed Computing at Carleton University over the last 20 years, and for the advanced graduate courses on Distributed Algorithms I have taught as part of the Advanced Summer School on Distributed Computing at the University of Siena over the last 10 years. I am most grateful to all the students of these courses: through their feedback they have helped me verify what works and what does not, shaping my teaching and thus the current structure of this book. Their keen interest and enthusiasm over the years have been the main reason for the existence of this book. This book is very much work in progress. I would welcome any feedback that will make it grow and mature and change. Comments, criticisms, and reports on personal experience as a lecturer using the book, as a student studying it, or as a researcher glancing through it, suggestions for changes, and so forth: I am looking foreward to receiving any. Clearly, reports on typos, errors, and mistakes are very much appreciated. I tried to be accurate in giving credits; if you know of any omission or mistake in this regards, please let me know. My own experience as well as that of my students leads to the inescapable conclusion that distributed algorithms are fun both to teach and to learn. I welcome you to share this experience, and I hope you will reach the same conclusion. Nicola Santoro

CHAPTER 1

Distributed Computing Environments

The universe in which we will be operating will be called a distributed computing environment. It consists of a ﬁnite collection E of computational entities communicating by means of messages. Entities communicate with other entities to achieve a common goal; for example, to perform a given task, to compute the solution to a problem, to satisfy a request either from the user (i.e., outside the environment) or from other entities. In this chapter, we will examine this universe in some detail.

1.1 ENTITIES The computational unit of a distributed computing environment is called an entity . Depending on the system being modeled by the environment, an entity could correspond to a process, a processor, a switch, an agent, and so forth in the system. Capabilities Each entity x ∈ E is endowed with local (i.e., private and nonshared) memory Mx . The capabilities of x include access (storage and retrieval) to local memory, local processing, and communication (preparation, transmission, and reception of messages). Local memory includes a set of deﬁned registers whose values are always initially deﬁned; among them are the status register (denoted by status(x)) and the input value register (denoted by value(x)). The register status(x) takes values from a ﬁnite set of system states S; the examples of such values are “Idle,” “Processing,” “Waiting,”. . . and so forth. In addition, each entity x ∈ E has available a local alarm clock cx which it can set and reset (turn off). An entity can perform only four types of operations:

local storage and processing transmission of messages (re)setting of the alarm clock changing the value of the status register

Design and Analysis of Distributed Algorithms, by Nicola Santoro Copyright © 2007 John Wiley & Sons, Inc.

1

2

DISTRIBUTED COMPUTING ENVIRONMENTS

Note that, although setting the alarm clock and updating the status register can be considered as a part of local processing, because of the special role these operations play, we will consider them as distinct types of operations. External Events The behavior of an entity x ∈ E is reactive: x only responds to external stimuli, which we call external events (or just events); in the absence of stimuli, x is inert and does nothing. There are three possible external events: arrival of a message ringing of the alarm clock spontaneous impulse The arrival of a message and the ringing of the alarm clock are the events that are external to the entity but originate within the system: The message is sent by another entity, and the alarm clock is set by the entity itself. Unlike the other two types of events, a spontaneous impulse is triggered by forces external to the system and thus outside the universe perceived by the entity. As an example of event generated by forces external to the system, consider an automated banking system: its entities are the bank servers where the data is stored, and the automated teller machine (ATM) machines; the request by a customer for a cash withdrawal (i.e., update of data stored in the system) is a spontaneous impulse for the ATM machine (the entity) where the request is made. For another example, consider a communication subsystem in the open systems interconnection (OSI) Reference Model: the request from the network layer for a service by the data link layer (the system) is a spontaneous impulse for the data-link-layer entity where the request is made. Appearing to entities as “acts of God,” the spontaneous impulses are the events that start the computation and the communication. Actions When an external event e occurs, an entity x ∈ E will react to e by performing a ﬁnite, indivisible, and terminating sequence of operations called action. An action is indivisible (or atomic) in the sense that its operations are executed without interruption; in other words, once an action starts, it will not stop until it is ﬁnished. An action is terminating in the sense that, once it is started, its execution ends within ﬁnite time. (Programs that do not terminate cannot be termed as actions.) A special action that an entity may take is the null action nil, where the entity does not react to the event. Behavior The nature of the action performed by the entity depends on the nature of the event e, as well as on which status the entity is in (i.e., the value of status(x)) when the events occur. Thus the speciﬁcation will take the form Status × Event −→ Action,

ENTITIES

3

which will be called a rule (or a method, or a production). In a rule s × e −→ A, we say that the rule is enabled by (s, e). The behavioral speciﬁcation, or simply behavior, of an entity x is the set B(x) of all the rules that x obeys. This set must be complete and nonambiguous: for every possible event e and status value s, there is one and only one rule in B(x) enabled by (s,e). In other words, x must always know exactly what it must do when an event occurs. The set of rules B(x) is also called protocol or distributed algorithm of x. The behavioral speciﬁcation of the entire distributed computing environment is just the collection of the individual behaviors of the entities. More precisely, the collective behavior B(E) of a collection E of entities is the set B(E) = {B(x): x ∈ E}. Thus, in an environment with collective behavior B(E), each entity x will be acting (behaving) according to its distributed algorithm and protocol (set of rules) B(x). Homogeneous Behavior A collective behavior is homogeneous if all entities in the system have the same behavior, that is, ∀x, y ∈ E, B(x) = B(y). This means that to specify a homogeneous collective behavior, it is sufﬁcient to specify the behavior of a single entity; in this case, we will indicate the behavior simply by B. An interesting and important fact is the following: Property 1.1.1 Every collective behavior can be made homogeneous. This means that if we are in a system where different entities have different behaviors, we can write a new set of rules, the same for all of them, which will still make them behave as before. Example Consider a system composed of a network of several identical workstations and a single server; clearly, the set of rules that the server and a workstation obey is not the same as their functionality differs. Still, a single program can be written that will run on both entities without modifying their functionality. We need to add to each entity an input register, my role, which is initialized to either “workstation” or “server,” depending on the entity; for each status–event pair (s, e) we create a new rule with the following action: s × e −→ { if my role = workstation then Aworkstation else Aserver endif }, where Aworkstation (respectively, Aserver ) is the original action associated to (s, e) in the set of rules of the workstation (respectively, server). If (s, e) did not enable any rule for a workstation (e.g., s was a status deﬁned only for the server), then Aworkstation = nil in the new rule; analogously for the server. It is important to stress that in a homogeneous system, although all entities have the same behavioral description (software), they do not have to act in the same way;

4

DISTRIBUTED COMPUTING ENVIRONMENTS

their difference will depend solely on the initial value of their input registers. An analogy is the legal system in democratic countries: the law (the set of rules) is the same for every citizen (entity); still, if you are in the police force, while on duty, you are allowed to perform actions that are unlawful for most of the other citizens. An important consequence of the homogeneous behavior property is that we can concentrate solely on environments where all the entities have the same behavior. From now on, when we mention behavior we will always mean homogeneous collective behavior.

1.2 COMMUNICATION In a distributed computing environment, entities communicate by transmitting and receiving messages. The message is the unit of communication of a distributed environment. In its more general deﬁnition, a message is just a ﬁnite sequence of bits. An entity communicates by transmitting messages to and receiving messages from other entities. The set of entities with which an entity can communicate directly is not necessarily E; in other words, it is possible that an entity can communicate directly only with a subset of the other entities. We denote by Nout (x) ⊆ E the set of entities to which x can transmit a message directly; we shall call them the out-neighbors of x . Similarly, we denote by Nin (x) ⊆ E the set of entities from which x can receive a message directly; we shall call them the in-neighbors of x. = (V , E), where V The neighborhood relationship deﬁnes a directed graph G ⊆ V × V is the set of edges; the vertices correspond to is the set of vertices and E if and only if the entity (corresponding to) y is an out-neighbor entities, and (x, y) ∈ E of the entity (corresponding to) x. = (V , E) describes the communication topology of the enviThe directed graph G m(G), and d(G) the number of vertices, edges, and ronment. We shall denote by n(G), respectively. When no ambiguity arises, we will omit the reference the diameter of G, and use simply n, m, and d. to G In the following and unless ambiguity should arise, the terms vertex, node, site, and entity will be used as having the same meaning; analogously, the terms edge, arc, and link will be used interchangeably. In summary, an entity can only receive messages from its in-neighbors and send messages to its out-neighbors. Messages received at an entity are processed there in the order they arrive; if more than one message arrive at the same time, they will be processed in arbitrary order (see Section 1.9). Entities and communication may fail.

1.3 AXIOMS AND RESTRICTIONS The deﬁnition of distributed computing environment with point-to-point communication has two basic axioms, one on communication delay, and the other on the local orientation of the entities in the system.

AXIOMS AND RESTRICTIONS

5

Any additional assumption (e.g., property of the network, a priori knowledge by the entities) will be called a restriction. 1.3.1 Axioms Communication Delays Communication of a message involves many activities: preparation, transmission, reception, and processing. In real systems described by our model, the time required by these activities is unpredictable. For example, in a communication network a message will be subject to queueing and processing delays, which change depending on the network trafﬁc at that time; for example, consider the delay in accessing (i.e., sending a message to and getting a reply from) a popular web site. The totality of delays encountered by a message will be called the communication delay of that message. Axiom 1.3.1 Finite Communication Delays In the absence of failures, communication delays are ﬁnite. In other words, in the absence of failures, a message sent to an out-neighbor will eventually arrive in its integrity and be processed there. Note that the Finite Communication Delays axiom does not imply the existence of any bound on transmission, queueing, or processing delays; it only states that in the absence of failure, a message will arrive after a ﬁnite amount of time without corruption. Local Orientation An entity can communicate directly with a subset of the other entities: its neighbors. The only other axiom in the model is that an entity can distinguish between its neighbors. Axiom 1.3.2 Local Orientation An entity can distinguish among its in-neighbors. An entity can distinguish among its out-neighbors. In particular, an entity is capable of sending a message only to a speciﬁc out-neighbor (without having to send it also to all other out-neighbors). Also, when processing a message (i.e., executing the rule enabled by the reception of that message), an entity can distinguish which of its in-neighbors sent that message. In other words, each entity x has a local function lx associating labels, also called port numbers, to its incident links (or ports), and this function is injective. We denote port numbers by lx (x, y), the label associated by x to the link (x, y). Let us stress that this label is local to x and in general has no relationship at all with what y might call there are two labels: lx (x, this link (or x, or itself). Note that for each edge (x, y)∈ E, y) local to x and ly (x, y) local to y (see Figure 1.1). l), where Because of this axiom, we will always deal with edge-labeled graphs (G, l = {lx : x ∈ V } is the set of these injective labelings.

6

DISTRIBUTED COMPUTING ENVIRONMENTS

x

y

FIGURE 1.1: Every edge has two labels

1.3.2 Restrictions In general, a distributed computing system might have additional properties or capabilities that can be exploited to solve a problem, to achieve a task, and to provide a service. This can be achieved by using these properties and capabilities in the set of rules. However, any property used in the protocol limits the applicability of the protocol. In other words, any additional property or capability of the system is actually a restriction (or submodel) of the general model. WARNING. When dealing with (e.g., designing, developing, testing, employing) a distributed computing system or just a protocol, it is crucial and imperative that all restrictions are made explicit. Failure to do so will invalidate the resulting communication software. The restrictions can be varied in nature and type: they might be related to communication properties, reliability, synchrony, and so forth. In the following section, we will discuss some of the most common restrictions. Communication Restrictions The ﬁrst category of restrictions includes those relating to communication among entities. Queueing Policy A link (x, y) can be viewed as a channel or a queue (see Section 1.9): x sending a message to y is equivalent to x inserting the message in the channel. In general, all kinds of situations are possible; for example, messages in the channel might overtake each other, and a later message might be received ﬁrst. Different restrictions on the model will describe different disciplines employed to manage the channel; for example, ﬁrst-in-ﬁrst-out (FIFO) queues are characterized by the following restriction. Message Ordering: In the absence of failure, the messages transmitted by an entity to the same out-neighbor will arrive in the same order they are sent. Note that Message Ordering does not imply the existence of any ordering for messages transmitted to the same entity from different edges, nor for messages sent by the same entity on different edges. Link Property Entities in a communication system are connected by physical links, which may be very different in capabilities. The examples are simplex and full-duplex

7

AXIOMS AND RESTRICTIONS

links. With a fully duplex line it is possible to transmit in both directions. Simplex lines are already deﬁned within the general model. A duplex line can obviously be described as two simplex lines, one in each direction; thus, a system where all lines are fully duplex can be described by the following restriction: Reciprocal communication: ∀x ∈ E, Nin (x) = Nout (x). In other words, if then also (y, x)∈ E. (x, y) ∈ E Notice that, however, (x, y) = (y, x), and in general lx (x, y) = lx (y, x); furthermore, x might not know that these two links are connections to and from the same entity. A system with fully duplex links that offers such a knowledge is deﬁned by the following restriction. Bidirectional links: ∀x ∈ E, Nin (x) = Nout (x) and lx (x, y) = lx (y, x).

IMPORTANT. The case of Bidirectional Links is special. If it holds, we use a simpliﬁed terminology. The network is viewed as an undirected graph G = (V,E) (i.e., ∀ x,y∈ E, (x,y) = (y, x) ), and the set N(x) = Nin (x) = Nout (x) will just be called = |E| = 2 |E| = 2 m(G). the set of neighbors of x. Note that in this case, m(G) is depicted where the Bidirectional Links For example, in Figure 1.2 a graph G restriction and the corresponding undirected graph G hold. Reliability Restrictions Other types of restrictions are those related to reliability, faults, and their detection.

b

c

X

Z

c

b

d

a a

d b

b

X

c

Z d

a

c

b

c

Y

G = ( V, E )

b

c

Y

G = ( V, E )

FIGURE 1.2: In a network with Bidirectional Links we consider the corresponding undirected graph.

8

DISTRIBUTED COMPUTING ENVIRONMENTS

Detection of Faults Some systems might provide a reliable fault-detection mechanism. Following are two restrictions that describe systems that offer such capabilities in regard to component failures: Edge failure detection: ∀ (x, y) ∈ E, both x and y will detect whether (x, y) has failed and, following its failure, whether it has been reactivated. Entity failure detection: ∀x ∈ V , all in- and out-neighbors of x can detect whether x has failed and, following its failure, whether it has recovered. Restricted Types of Faults In some systems only some types of failures can occur: for example, messages can be lost but not corrupted. Each situation will give rise to a corresponding restriction. More general restrictions will describe systems or situations where there will be no failures: Guaranteed delivery: Any message that is sent will be received with its content uncorrupted. Under this restriction, protocols do not need to take into account omissions or corruptions of messages during transmission. Even more general is the following: Partial reliability: No failures will occur. Under this restriction, protocols do not need to take failures into account. Note that under Partial Reliability, failures might have occurred before the execution of a computation. A totally fault-free system is deﬁned by the following restriction. Total reliability: Neither have any failures occurred nor will they occur. Clearly, protocols developed under this restriction are not guaranteed to work correctly if faults occur. Topological Restrictions In general, an entity is not directly connected to all other entities; it might still be able to communicate information to a remote entity, using others as relayer. A system that provides this capability for all entities is characterized by the following restriction: Connectivity: The communication topology G is strongly connected. it is possible to reach every other vertex. In case That is, from every vertex in G the restriction “Bidirectional Links” holds as well, connectedness will simply state that G is connected.

COST AND COMPLEXITY

9

Time Restrictions An interesting type of restrictions is the one relating to time. In fact, the general model makes no assumption about delays (except that they are ﬁnite). Bounded communication delays: There exists a constant ⌬ such that, in the absence of failures, the communication delay of any message on any link is at most ⌬. A special case of bounded delays is the following: Unitary communication delays: In the absence of failures, the communication delay of any message on any link is one unit of time. The general model also makes no assumptions about the local clocks. Synchronized clocks: All local clocks are incremented by one unit simultaneously and the interval of time between successive increments is constant. 1.4 COST AND COMPLEXITY The computing environment we are considering is deﬁned at an abstract level. It models rather different systems (e.g., communication networks, distributed systems, data networks, etc.), whose performance is determined by very distinctive factors and costs. The efﬁciency of a protocol in the model must somehow reﬂect the realistic costs encountered when executed in those very different systems. In other words, we need abstract cost measures that are general enough but still meaningful. We will use two types of measures: the amount of communication activities and the time required by the execution of a computation. They can be seen as measuring costs from the system point of view (how much trafﬁc will this computation generate and how busy will the system be?) and from the user point of view (how long will it take before I get the results of the computation?). 1.4.1 Amount of Communication Activities The transmission of a message through an out-port (i.e., to an out-neighbor) is the basic communication activity in the system; note that the transmission of a message that will not be received because of failure still constitutes a communication activity. Thus, to measure the amount of communication activities, the most common function used is the number of message transmissions M, also called message cost. So in general, given a protocol, we will measure its communication costs in terms of the number of transmitted messages. Other functions of interest are the entity workload Lnode = M/|V |, that is, the number of messages per entity, and the transmission load Llink = M/|E|, that is, the number of messages per link.

10

DISTRIBUTED COMPUTING ENVIRONMENTS

Messages are sequences of bits; some protocols might employ messages that are very short (e.g., O(1) bit signals), others very long (e.g., .gif ﬁles). Thus, for a more accurate assessment of a protocol, or to compare different solutions to the same problem that use different sizes of messages, it might be necessary to use as a cost measure the number of transmitted bits B also called bit complexity. In this case, we may sometimes consider the bit-deﬁned load functions: the entity bit-workload Lbnode = B/|V |, that is, the number of bits per entity, and the transmission bit-load Lblink = B/|E|, that is, the number of bits per link. 1.4.2 Time An important measure of efﬁciency and complexity is the total execution delay, that is, the delay between the time the ﬁrst entity starts the execution of a computation and the time the last entity terminates its execution. Note that “time” is here intended as the one measured by an observer external to the system and will also be called real or physical time. In the general model there is no assumption about time except that communication delays for a single message are ﬁnite in absence of failure (Axiom 1.3.1). In other words, communication delays are in general unpredictable. Thus, even in the absence of failures, the total execution delay for a computation is totally unpredictable; furthermore, two distinct executions of the same protocol might experience drastically different delays. In other words, we cannot accurately measure time. We, however, can measure time assuming particular conditions. The measure usually employed is the ideal execution delay or ideal time complexity, T: the execution delay experienced under the restrictions “Unitary Transmission Delays” and “Synchronized Clocks;” that is, when the system is synchronous and (in the absence of failure) takes one unit of time for a message to arrive and to be processed. A very different cost measure is the causal time complexity, Tcausal . It is deﬁned as the length of the longest chain of causally related message transmissions, over all possible executions. Causal time is seldom used and is very difﬁcult to measure exactly; we will employ it only once, when dealing with synchronous computations.

1.5 AN EXAMPLE: BROADCASTING Let us clarify the concepts expressed so far by means of an example. Consider a distributed computing system where one entity has some important information unknown to the others and would like to share it with everybody else. This problem is called broadcasting and it is part of a general class of problems called information diffusion. To solve this problem means to design a set of rules that, when executed by the entities, will lead (within ﬁnite time) to all entities knowing the information; the solution must work regardless of which entity had the information at the beginning. be the communication topology. Let E be the collection of entities and G

AN EXAMPLE: BROADCASTING

11

To simplify the discussion, we will make some additional assumptions (i.e., restrictions) on the system: 1. Bidirectional links; that is, we consider the undirected graph G. (see Section 1.3.2). 2. Total reliability, that is, we do not have to worry about failures. Observe that, if G is disconnected, some entities can never receive the information, and the broadcasting problem will be unsolvable. Thus, a restriction that (unlike the previous two) we need to make is as follows: 3. Connectivity; that is, G is connected. Further observe that built in the deﬁnition of the problem, there is the assumption that only the entity with the initial information will start the broadcast. Thus, a restriction built in the deﬁnition is as follows: 4. Unique Initiator, that is, only one entity will start. A simple strategy for solving the broadcast problem is the following: “if an entity knows the information, it will share it with its neighbors.” To construct the set of rules implementing this strategy, we need to deﬁne the set S of status values; from the statement of the problem it is clear that we need to distinguish between the entity that initially has the information and the others: {initiator, idle} ⊆ S. The process can be started only by the initiator; let I denote the information to be broadcasted. Here is the set of rules B(x) (the same for all entities): 1. 2. 3. 4.

initiator ×ι −→ {send(I) to N (x)} idle × Receiving(I) −→ {Process(I); send(I) to N (x)} initiator × Receiving(I) −→ nil idle ×ι −→ nil

where ι denotes the spontaneous impulse event and nil denotes the null action. Because of connectivity and total reliability, every entity will eventually receive the information. Hence, the protocol achieves its goal and solves the broadcasting problem. However, there is a serious problem with these rules: the activities generated by the protocol never terminate. Consider, for example, the simple system with three entities x, y, z connected to each other (see Figure 1.3). Let x be the initiator, y and z be idle, and all messages travel at the same speed; then y and z will be forever sending messages to each other (as well as to x).

12

DISTRIBUTED COMPUTING ENVIRONMENTS

X

Z

X

Y

X

Z

Y

Z

X

Z

Y

Y

FIGURE 1.3: An execution of Flooding.

To avoid this unwelcome effect, an entity should send the information to its neighbors only once: the ﬁrst time it acquires the information. This can be achieved by introducing a new status done; that is S ={initiator, idle, done}. 1. 2. 3. 4. 5. 6.

initiator ×ι −→ {send(I ) to N (x); become done} idle × Receiving(I) −→ {Process(I); become done; send(I) to N (x)} initiator × Receiving(I) −→ nil idle × ι −→ nil done × Receiving(I) −→ nil done × ι −→ nil

where become denotes the operation of changing status. This time the communication activities of the protocol terminate: Within ﬁnite time all entities become done; since a done entity knows the information, the protocol is correct (see Exercise 1.12.1 ). Note that depending on transmission delays, different executions are possible; one such execution in an environment composed of three entities x, y, z connected to each other, where x is the initiator as depicted in Figure 1.3. IMPORTANT. Note that entities terminate their execution of the protocol (i.e., become done) at different times; it is actually possible that an entity has terminated while others have not yet started. This is something very typical of distributed computations: There is a difference between local termination and global termination.

AN EXAMPLE: BROADCASTING

13

IMPORTANT. Notice also that in this protocol nobody ever knows when the entire process is over. We will examine these issues in details in other chapters, in particular when discussing the problem of termination detection. The above set of rules correctly solves the problem of broadcasting. Let us now calculate the communication costs of the algorithm. First of all, let us determine the number of message transmissions. Each entity, whether initiator or not, sends the information to all its neighbors. Hence the total number of messages transmitted is exactly x∈E

|N (x)| = 2 |E| = 2 m.

We can actually reduce the cost. Currently, when an idle entity receives the message, it will broadcast the information to all its neighbors, including the entity from which it had received the information; this is clearly unnecessary. Recall that, by the Local Orientation axiom, an entity can distinguish among its neighbors; in particular, when processing a message, it can identify from which port it was received and avoid sending a message there. The ﬁnal protocol is as before with only this small modiﬁcation. Protocol Flooding 1. 2. 3. 4. 5. 6.

initiator ×ι −→ {send(I) to N (x); become done} idle × Receiving(I) −→ {Process(I); become done; send(I) to N (x)-sender} initiator × Receiving(I) −→ nil idle ×ι −→ nil done × Receiving(I) −→ nil done ×ι −→ nil

where sender is the neighbor that sent the message currently being processed. This algorithm is called Flooding as the entire system is “ﬂooded” with the message during its execution, and it is a basic algorithmic tool for distributed computing. As for the number of message transmissions required by ﬂooding, because we avoid transmitting some messages, we know that it is less than 2m; in fact, (Exercise 1.12.2): M[Flooding] = 2m − n + 1.

(1.1)

Let us examine now the ideal time complexity of ﬂooding. Let d(x, y) denote the distance (i.e., the length of the shortest path) between x and y in G. Clearly the message sent by the initiator has to reach every entity in the system, including the furthermost one from the initiator. So, if x is the initiator, the ideal time complexity will be r(x) = Max {d(x, y) : y ∈ E}, which is called the eccentricity (or radius) of x. In other words, the total time depends on which entity is the initiator and

14

DISTRIBUTED COMPUTING ENVIRONMENTS

thus cannot be known precisely beforehand. We can, however, determine exactly the ideal time complexity in the worst case. Since any entity could be the initiator, the ideal time complexity in the worst case will be d(G) = Max {r(x) : x ∈ E}, which is the diameter of G. In other words, the ideal time complexity will be at most the diameter of G: T[Flooding] ≤ d(G).

(1.2)

1.6 STATES AND EVENTS Once we have deﬁned the behavior of the entities, their communication topology, and the set of restrictions under which they operate, we must describe the initial conditions of our environment. This is done ﬁrst of all by specifying the initial condition of all the entities. The initial content of all the registers of entity x and the initial value of its alarm clock cx at time t constitute the initial internal state σ (x, 0) of x. Let (0) = {σ (x, 0) : x ∈ E} denote the set of all the initial internal states. Once (0) is deﬁned, we have completed the static speciﬁcation of the environment: the description of the system before any event occurs and before any activity takes place. We are, however, also interested in describing the system during the computational activities, as well as after such activities. To do so, we need to be able to describe the changes that the system undergoes over time. As mentioned before, the entities (and, thus the environments) are reactive. That is, any activity of the system is determined entirely by the external events. Let us examine these facts in more detail. 1.6.1 Time and Events In distributed computing environments, there are only three types of external events: spontaneous impulse (spontaneously), reception of a message (receiving), and alarm clock ring (when). When an external event occurs at an entity, it triggers the execution of an action (the nature of the action depends on the status of the entity when the event occurs). The executed action may generate new events: The operation send will generate a receiving event, and the operation set alarm will generate a when event. Note ﬁrst of all that the events so generated might not occur at all. For example, a link failure may destroy the traveling message, destroying the corresponding receiving event; in a subsequent action, an entity may turn off the previously set alarm destroying the when event. Notice now that if they occur, these events will do so at a later time (i.e., when the message arrives, when the alarm goes off). This delay might be known precisely in the case of the alarm clock (because it is set by the entity); it is, however, unpredictable in the case of message transmission (because it is due to the conditions external to the entity). Different delays give rise to different executions of the same protocols with possibly different outcomes.

STATES AND EVENTS

15

Summarizing, each event e is “generated” at some time t(e) and, if it occurs, it will happen at some time later. By deﬁnition, all spontaneous impulses are already generated before the execution starts; their set will be called the set of initial events. The execution of the protocol starts when the ﬁrst spontaneous impulses actually happen; by convention, this will be time t = 0. IMPORTANT. Notice that “time” is here considered as seen by an external observer and is viewed as real time. Each real time instant t separates the axis of time into three parts: past (i.e., {t < t}), present (i.e., t), and future (i.e., {t > t}). All events generated before t that will happen after t are called the future at t and denoted by Future(t); it represents the set of future events determined by the execution so far. An execution is fully described by the sequence of events that have occurred. For small systems, an execution can be visualized by what is called a Time × Event Diagram (TED) . Such a diagram is composed of temporal lines, one for each entity in the system. Each event is represented in such a diagram as follows: A Receiving event r is represented as an arrow from the point tx (r) in the temporal line of the entity x generating e (i.e., sending the message) to the point ty (r) in the temporal line of the entity y where the events occur (i.e., receiving the message). A When event w is represented as an arrow from point tx (w) to point tx (w) in the temporal line of the entity setting the clock. A Spontaneously event ι is represented as a short arrow indicating point tx (ι) in the temporal line of the entity x where the events occur.

For example, in Figure 1.4 is depicted the TED corresponding to the execution of Protocol Flooding of Figure 1.3.

x

y

z

FIGURE 1.4: Time × Event Diagram

16

DISTRIBUTED COMPUTING ENVIRONMENTS

1.6.2 States and Conﬁgurations The private memory of each entity, in addition to the behavior, contains a set of registers, some of them already initialized, others to be initialized during the execution. The content of all the registers of entity x and the value of its alarm clock cx at time t constitutewhat is called the internal state of x at t and is denoted by σ (x, t). We denote by (t) the set of the internal states at time t of all entities. Internal states change with time and the occurrence of events. There is an important fact about internal states. Consider two different environments, E1 and E2 , where, by accident, the internal state of x at time t is the same. Then x cannot distinguish between the two environments, that is, x is unable to tell whether it is in environment E1 or E2 . There is an important consequence. Consider the situation just described: At time t, the internal state of x is the same in both E1 and E2 . Assume now that also by accident, exactly the same event occurs at x (e.g., the alarm clock rings or the same message is received from the same neighbor). Then x will perform exactly the same action in both cases, and its internal state will continue to be the same in both situations. Property 1.6.1 Let the same event occur at x at time t in two different executions, and let σ1 and σ2 be its internal states when this happens. If σ1 = σ2 , then the new internal state of x will be the same in both executions. Similarly, if two entities have the same internal state, they cannot distinguish between each other. Furthermore, if by accident, exactly the same event occurs at both of them (e.g., the alarm clock rings or the same message is received from the same neighbor), then they will perform exactly the same action in both cases, and their internal state will continue to be the same in both situations. Property 1.6.2 Let the same event occur at x and y at time t, and let σ1 and σ2 be their internal states, respectively, at that time. If σ1 = σ2 , then the new internal state of x and y will be the same. Remember: Internal states are local and an entity might not be able to infer from them information about the status of the rest of the system. We have talked about the internal state of an entity, initially (i.e., at time t = 0) and during an execution. Let us now focus on the state of the entire system during an execution. To describe the global state of the environment at time t, weobviously need to specify the internal state of all entities at that time; that is, the set (t). However, this is not enough. In fact, the execution so far might have already generated some events that will occur after time t; these events, represented by the set Future(t), are integral part of this execution and must be speciﬁed as well. Speciﬁcally, the global state, called conﬁguration, of the system during an execution is speciﬁed by the couple t , Future t C t =

PROBLEMS AND SOLUTIONS ()

17

The initial conﬁguration C(0) contains not only the initial set of states (0) but also the set Future(0) of the spontaneous impulses. Environments that differ only in their initial conﬁguration will be called instances of the same system. The conﬁguration C(t) is like a snapshot of the system at time t.

1.7 PROBLEMS AND SOLUTIONS () The topic of this book is how to design distributed algorithms and analyze their complexity. A distributed algorithm is the set of rules that will regulate the behaviors of the entities. The reason why we may need to design the behaviors is to enable the entities to solve a given problem, perform a deﬁned task, or provide a requested service. In general, we will be given a problem, and our task is to design a set of rules that will always solve the problem in ﬁnite time. Let us discuss these concepts in some details. Problems To give a problem (or task, or service) P means to give a description of what the entities must accomplish. This is done by stating what the initial conditions of the entities are (and thus of the system), and what the ﬁnal conditions should be; it should also specify all given restrictions. In other words, P = PINIT , PFINAL , R , where PINIT and PFINAL are predicates on the values of the registers of the entities, and R is a set of restrictions. Let wt (x) denote the value of an input register w(x) at time t and {wt } = {wt (x) : x ∈ E} the values of this register at all entities at that time. So, for example, {status0 } represents the initial value of the status registers of the entities. For example, in the problem Broadcasting (I ) described in Section 1.5, the initial and ﬁnal conditions are given by the predicates PINIT (t) ≡ “ only one entity has the information at time t” ≡ ∃x ∈ E (valuet (x) = I ∧ ∀y = x (valuet (y) = ø)), PFINAL (t) ≡ “ every entity has the information at time t” ≡ ∀x ∈ E (valuet (x) = I ). The restrictions we have imposed on our solution are BL (Bidirectional Links), TR (Total Reliability), and CN (Connectivity). Implicit in the problem deﬁnition there is also the condition that only the entity with the information will start the execution of the solution protocol; denote by UI the predicate describing this restriction, called Unique Initiator. Summarizing, for Broadcasting, the set of restrictions we have made is {BL, TR, CN, UI}.

18

DISTRIBUTED COMPUTING ENVIRONMENTS

Status A solution protocol B for P = PINIT , PFINAL , R will specify how the entities will accomplish the required task. Part of the design of the set of rules B(x) is the deﬁnition of the set of status values S, that is, the values that can be held by the status register status(x). We call initial status values those values of S that can be held at the start of the execution of B(x) and we shall denote their set by SINIT . By contrast, terminal status values are those values that once reached, cannot ever be changed by the protocol; their set shall be denoted by STERM . All other values in S will be called intermediate status values. For example, in the protocol Flooding described in Section 1.5, SINIT ={initiator, idle} and STERM ={done}. Depending on the restrictions of the problem, only entities in speciﬁc initial status values will start the protocol; we shall denote by SSTART ⊆ SINIT the set of those status values. Typically, SSTART consists of only one status; for example, in Flooding, SSTART ={initiator}. It is possible to rewrite a protocol so that this is always the case (see Exercise 1.12.5). Among terminal status values we shall distinguish those in which no further activity can take place; that is, those where the only action is nil. We shall call such status values ﬁnal and we shall denote by SFINAL ⊆ STERM the set of those status values. For example, in Flooding, SFINAL ={done}. Termination Protocol B terminates if, for all initial conﬁgurations C(0) satisfying PINIT , and for all executions starting from those conﬁgurations, the predicate Terminate (t) ≡ ({statust } ⊆ STERM )∧ (Future(t) = ∅) holds for some t > 0, that is, all entities enter a terminal status after a ﬁnite time and all generated events have occurred. We have already remarked on the fact that entities might not be aware that the termination has occurred. In general, we would like each entity to know at least of its termination. This situation, called explicit termination, is said to occur if the predicate Explicit-Terminate (t) ≡ ({statust } ⊆ SFINAL ) holds for some t > 0, that is, all entities enter a ﬁnal status after a ﬁnite time. Correctness Protocol B is correct if, for all executions starting from initial conﬁgurations satisfying PINIT , ∃t > 0 : Correct(t) holds, where Correct(t) ≡ (∀t ≥ t, PFINAL (t)); that is, the ﬁnal predicate eventually holds and does not change.

KNOWLEDGE

19

Solution Protocol The set of rules B solves problem P if it always correctly terminates under the problem restrictions R. As there are two types of termination (simple and explicit), we will have two types of solutions: Simple Solution[B,P] where the predicate ∃t > 0 (Correct(t)∧ Terminate(t)) holds, under the problem restrictions R, for all executions starting from initial conﬁgurations satisfying PINIT ; and Explicit Solution[B,P] where the predicate ∃t > 0 (Correct(t)∧ Explicit-Terminate(t)) holds, under the problem restrictions R, for all executions starting from initial conﬁgurations satisfying PINIT .

1.8 KNOWLEDGE The notions of information and knowledge are fundamental in distributed computing. Informally, any distributed computation can be viewed as the process of acquiring information through communication activities; conversely, the reception of a message can be viewed as the process of transforming the state of knowledge of the processor receiving the message. 1.8.1 Levels of Knowledge The content of the local memory of an entity and the information that can be derived from it constitute the local knowledge of an entity. We denote by p ∈ LKt [x] the fact that p is local knowledge at x at the global time instant t. By deﬁnition, lx ∈ LKt [x] for all t, that is, the (labels of the) in- and out-edges of x are timeinvariant local knowledge of x. Sometimes it is necessary to describe knowledge held by more than one entity at a given time. Information p is said to be implicit knowledge in W ⊆ E at time t, denoted by p ∈ IKt [W ], if at least one entity in W knows p at time t, that is, p ∈ IKt [W ] iff ∃x ∈ W (p ∈ LKt [x]). A stronger level of knowledge in a group W of entities is held when, at a given time t, p is known to every entity in the group, denoted by p ∈ EKt [W ], that is p ∈ EKt [W ] iff ∀x ∈ W (p ∈ LKt [x]).

20

DISTRIBUTED COMPUTING ENVIRONMENTS

In this case, p is said to be explicit knowledge in W ⊆ E at time t. Consider for example broadcasting discussed in the previous section. Initially, at time t = 0, only the initiator s knows the information I; in other words, I ∈ LK0 [s]. Thus, at that time, I is implicitly known to all entities, that is, I ∈ IK0 [E]. At the end of the broadcast, at time t , every entity will know the information; in other words, I ∈ EKt [E]. Notice that, in the absence of failures, knowledge cannot be lost, only gained, that is, for all t > t and all W ⊆ E, if no failure occurs, IKt [W ] ⊆ IKt [W ] and EKt [W ] ⊆ EKt [W ]. Assume that a fact p is explicit knowledge in W at time t. It is possible that some (maybe all) entities are not aware of this situation. For example, assume that at time t, entities x and y know the value of a variable of z, say its ID; then the ID of z is explicit knowledge in W={x, y, z}; however, z might not be aware that x and y know its ID. In other words, when p ∈ EKt [W ], the fact “p ∈ EKt [W ]" might not be even locally known to any of the entities in W. This gives rise to the highest level of knowledge within a group: common knowledge. Information p is said to be common knowledge in W ⊆ E at time t , denoted by p ∈ CKt [W ], if and only if at time t every entity in W knows p, and knows that every entity in W knows p, and knows that entity in W knows that every entity in W knows p, and . . . , etcetera, that is, p ∈ CKt [W ] iff

1≤i≤∞ Pi ,

where the Pi ’s are the predicates deﬁned by: P1 = [p ∈ ESt [W ]] and Pi+1 = [Pi ∈ EKt [W ]]. In most distributed problems, it will be necessary for the entities to achieve common knowledge. Fortunately, we do not always have to go to ∞ to reach common knowledge, and a ﬁnite number of steps might actually do, as indicated by the following example. Example (muddy forehead): Imagine n perceptive and intelligent school children playing together during recess. They are forbidden to play in the mud puddles, and the teacher has told them that if they do, there will be severe consequences. Each child wants to keep clean, but the temptation to play with mud is too great to resist. As a result, k of the children get mud on their foreheads. When the teacher arrives, she says, “I see that some of you have been playing in the mud puddle: the mud on your foreheads is a dead giveaway !” and then continues, “The guilty ones who come forward spontaneously will be given a small penalty; those who do not, will receive a punishment they will not easily forget.” She then adds, “I am going to leave the room now, and I will return periodically; if you decide to confess, you must all come forward together when I am in the room. In the meanwhile, everybody must sit absolutely still and without talking.” Each child in the room clearly understands that those with mud on their foreheads are “dead meat,” who will be punished no matter what. Obviously, the children do

KNOWLEDGE

21

not want to confess if the foreheads are clean, and clearly, if the foreheads are dirty, they want to go forward so as to avoid their terrible punishment for those who do not confess. As each child shares the same concern, the collective goal is for the children with clean foreheads not to confess and for those with muddy foreheads to go forward simultaneously, and all of this without communication. Let us examine this goal. The ﬁrst question is as follows: can a child x ﬁnd out whether his/her forehead is dirty or not ? She/he can see how many, say fx , of the other children are dirty; thus, the question is if x can determine whether k = fx or k = fx + 1. The second, more complex question is as follows: can all the children with mud on their foreheads ﬁnd out at the same time so that they can go forward together ? In other words, can the exact value of k become common knowledge ? The children, being perceptive and intelligent, determine that the answer to both the questions is positive and ﬁnd the way to achieve the common goal and thus common knowledge without communication (Exercise 1.12.6). IMPORTANT. When working in a submodel, all the restrictions deﬁning the submodel are common knowledge to all entities (unless otherwise speciﬁed). 1.8.2 Types of Knowledge We can have various types of knowledge, such as knowledge about the communication topology, about the labeling of the communication graph, about the input data of the communicating entities. In general, if we have some knowledge of the system, we can exploit it to reduce the cost of a protocol, although this may result in making the applicability of the protocol more limited. A type of knowledge of particular interest is the one regarding the communication In fact, as will be seen later, the complexity of a comtopology (i.e., the graph G). Following putation may vary greatly depending on what the entities know about G. are some elements that, if they are common knowledge to the entities, may affect the complexity. 1. Metric Information: numeric information about the network; for example, number n = |V | of nodes, number m = |E| of links, diameter, girth, etcetera. This information can be exact or approximate. 2. Topological Properties: knowledge of some properties of the topology; for is a ring network,” “G does not have cycles,” “G is a Cayley example, “G graph,” etcetera. 3. Topological Maps: a map of the neighborhood of the entity up to distance d, a (e.g., the adjacency matrix of G); a complete “map” of complete “map” of G (G,l) (i.e., it contains also the labels), etcetera. Note that some types of knowledge imply other knowledge; for example, if an entity with k neighbors knows that the network is a complete undirected graph, then it knows that n = k + 1.

22

DISTRIBUTED COMPUTING ENVIRONMENTS

As a topological map provides all possible metric and structural information, this type of knowledge is very powerful and important. The strongest form of this type is full topological knowledge: availability at each entity of a labeled graph isomorphic l), the isomorphism, and its own image, that is, every entity has a complete to (G, map of (v, l) with the indication, “You are here.” Another type of knowledge refers to the labeling l. What is very important is whether the labeling has some global consistency property. We can distinguish two other types, depending on whether the knowledge is about the (input) data or the status of the entities and of the system, and we shall call them type-D and type-S, respectively. Examples of type-D knowledge are the following: Unique identiﬁers: all input values are distinct; Multiset: input values are not necessarily identical; Size: number of distinct values. Examples of type-S knowledge are the following: System with leader: there is a unique entity in status “leader”; Reset: all nodes are in the same status; Unique initiator: there is a unique entity in status “initiator.” For example, in the broadcasting problem we discussed in Section 1.5, this knowledge was assumed as a part of the problem deﬁnition. 1.9 TECHNICAL CONSIDERATIONS 1.9.1 Messages The content of a message obviously depends on the application; in any case, it consists of a ﬁnite (usually bounded) sequence of bits. The message is typically divided into subsequences, called ﬁelds, with a predeﬁned meaning (“type”) within the protocol. The examples of ﬁeld types are the following: message identiﬁer or header used to distinguish between different types of messages; originator and destination ﬁelds used to specify the (identity of the) entity originating this message and of the entity to whom the message is intended for; data ﬁelds used to carry information needed in the computation (the nature of the information obviously depends on the particular application under consideration). Thus, in general, a message M will be viewed as a tuple M = f1 , f2 , . . . fk

where k is a (small) predeﬁned constant, and each fi (1 ≤ i ≤ k) is a ﬁeld of a speciﬁed type, each type of a ﬁxed length. So, for example, in protocol Flooding, there is only one type of message; it is composed of two ﬁelds M = f1 , f2 where f1 is a message identiﬁer (containing the information: “this is a broadcast message”), and f2 is a data ﬁeld containing the actual information I being broadcasted. If (the limit on) the size of a message is a system parameter (i.e., it does not depend on the particular application), we say that the system has bounded messages. Such is, for example, the limit imposed on the message length in packet-switching networks, as well as on the length of control messages in circuit-switching networks (e.g., telephone networks) and in message-switching networks.

TECHNICAL CONSIDERATIONS

23

Bounded messages are also called packets and contain at most µ(G) bits, where µ(G) is the system-dependent bound called packet size. Notice that, to send a sequence of K bits in G will require the transmission of at least K/µ(G) packets. 1.9.2 Protocol Notation A protocol B(x) is a set of rules. We have already introduced in Section 1.5 most of the notation for describing those rules. Let us now complete the description of the notation we will use for protocols. We will employ the following conventions: 1. Rules will be grouped by status. 2. If the action for a (status,event) pair is nil, then, for simplicity, the corresponding rule will be omitted from the description. As a consequence, if no rule is described for a (status,event) pair, the default will be that the pair enables the Null action. WARNING. Although convenient (it simpliﬁes the writing), the use of this convention must generate extra care in the description: If we forget to write a rule for an event occurring in a given status, it will be assumed that a rule exists and the action is nil. 3. If an action contains a change of status, this operation will be the last one before exiting the action. 4. The set of status values of the protocol, and the set of restrictions under which the protocol operates will be explicit. Using these conventions, the protocol Flooding deﬁned in Section 1.5 will be written as shown in Figure 1.5. Precedence The external events are as follows: spontaneous impulse (Spontaneously), reception of a message (Receiving), and alarm clock ring (When). Different types of external events can occur simultaneously; for example, the alarm clock might ring at the same time a message arrives. The simultaneous events will be processed sequentially. To determine the order in which they will be processed, we will use the following precedence between external events: Spontaneously > When > Receiving; that is, the spontaneous impulse takes precedence over the alarm clock, which has precedence over the arrival of a message. At most one spontaneous impulse can always occur at an entity at any one time. As there is locally only one alarm clock, at any time there will be at most one When event. By contrast, it is possible that more than one message arrive at the same time to an entity from different neighbors; should this be the case, these simultaneous

24

DISTRIBUTED COMPUTING ENVIRONMENTS

PROTOCOL Flooding .

Status Values: S = {INITIATOR, IDLE, DONE}; SINIT = {INITIATOR, IDLE}; STERM = {DONE}.

Restrictions: Bidirectional Links, Total Reliability, Connectivity, and Unique Initiator. INITIATOR Spontaneously begin send(M) to N (x); become DONE; end IDLE Receiving(I ) begin Process(M); send(M) to N (x) − {sender}; become DONE; end

FIGURE 1.5: Flooding Protocol

Receiving events have all the same precedence and will be processed sequentially in an arbitrary order. 1.9.3 Communication Mechanism The communication mechanisms of a distributed computing environment must handle transmissions and arrivals of messages. The mechanisms at an entity can be seen as a system of queues. corresponds to a queue, with access at x and exit at y; the Each link (x, y) ∈ E access is called out-port and the exit is called in-port. Each entity has thus two types of ports: out-ports, one for each out-neighbor (or out-link), and in-port, one for each in-neighbor (or in-link). At an entity, each outport has a distinct label (recall the Local Orientation axiom (Axiom 1.3.2)) called port number: the out-port corresponding to (x, y) has label lx (x, y); similarly for the in-ports. The sets Nin and Nout will in practice consist of the port numbers associated to those neighbors; this is because an entity has no other information about its neighbors (unless we add restrictions). The command “send M to W” will have a copy of the message M sent through each of the out-ports speciﬁed by W. When a message M is sent through an out-port l, it is inserted in the corresponding queue. In absence of failures (recall the Finite Communication Delays axiom), the communication mechanism will eventually remove it from the queue and deliver it to the other entity through the corresponding in-port, generating the Receiving (M) event; at that time the variable sender will be set to l.

BIBLIOGRAPHICAL NOTES

25

1.10 SUMMARY OF DEFINITIONS Distributed Environment: Collection of communicating computational entities. Communication: Transmission of message. Message: Bounded sequence of bits. Entity’s Capability: Local processing, local storage, access to a local clock, and communication. Entity’s Status Register: At any time an entity status register has a value from a predeﬁned set of status values. External Events: Arrival of a message, alarm clock ring, and spontaneous impulse. Entity’s Behavior: Entities react to external events. The behavior is dictated by a set of rules. Each rule has the form STATUS × EVENT → Action specifying what the entity has to do if a certain external event occurs when the entity is in a given status. The set of rules must be nonambiguous and complete. Actions: An action is an indivisible (i.e., uninterruptible) ﬁnite sequence of operations (local processing, message transmission, change of status, and setting of alarm clock). Homogeneous System: A system is homogeneous if all the entities have the same behavior. Every system can be made homogeneous. Neighbors: The in-neighbors of an entity are those entities from which x can receive a message directly; the out-neighbors are those to which x can send a message directly. Communication Topology: The directed graph G = (V , E) deﬁned by the neighborhood relation. If the Bidirectional Links restriction holds, then G is undirected. Axioms: There are two axioms: local orientation and ﬁnite communication delays. Local Orientation: An entity can distinguish between its out-neighbors and its in-neighbors. Finite Communication Delays: In absence of failures, a message eventually arrives. Restriction: Any additional property. 1.11 BIBLIOGRAPHICAL NOTES Several attempts have been made to derive formalisms capable of describing both distributed systems and computations performed in such systems. A signiﬁcant amount of study has been devoted to deﬁning formalisms, which would ease the task of formally proving properties of distributed computation (e.g., absence of deadlock, liveness, etc.). The models proposed for systems of concurrent processes do provide both a formalism for describing a distributed computation and a proof system that

26

DISTRIBUTED COMPUTING ENVIRONMENTS

can be employed within the formalism; such is, for example, the Unity model of Mani Chandi and Jayadev Misra [1]. Other models, whose intended goal is still to provide a proof system, have been speciﬁcally tailored for distributed computations. In particular, the Input–Output Automata model of Nancy Lynch and Mark Tuttle [4] provides a powerful tool that has helped discover and ﬁx “bugs” in well-known existing protocols. For the investigators involved in the design and analysis of distributed algorithms, the main concern rests with efﬁciency and complexity; proving correctness of an algorithm is a compulsory task, but it is usually accomplished using traditional mathematical tools (which are generally considered informal techniques) rather than with formal proof systems. The formal models of computation employed in these studies, as well as in the one used in this book, mainly focus on those factors that are directly related to efﬁciency of a distributed computation and complexity of a distributed problem: the underlining communication network, the communication primitives, the amount and type of knowledge available to the processors, etcetera. Modal logic, and in particular the notion of common knowledge, is a useful tool to reason about distributed computing environments in presence of failures. The notion of knowledge used here was developed independently by Joseph Halpern and Yoram Moses [2], Daniel J. Lehmann [3], and Stanley Rosenschein [5]. The model we have described and will employ in this book uses reactive entities (they react to external stimuli). Several formal models (including input–output Automata) use instead active entities. To understand this fundamental difference, consider a message in transit toward an entity that is expecting it, with no other activity in the system. In an active model, the entity will attempt to receive the message, even while it is not there; each attempt is an event; hence, this simple situation can actually cause an unpredictable number of events. By contrast, in a reactive model, the entity does nothing; the only event is the arrival of the message that will “wake up” the entity and trigger its response. Using the analogy of waiting for the delivery of a pizza, in the active model, you (the entity) must repeatedly open the door (i.e., act) to see if the person supposed to deliver the pizza has arrived; in the reactive model, you sit in the living room until the bell rings and then go and open the door (i.e., react). The two models are equally powerful; they just represent different ways of looking at and expressing the world. It is our contention that at least for the description and the complexity analysis of protocols and distributed algorithms, the reactive model is more expressive and simpler to understand, to handle, and to use. 1.12 EXERCISES, PROBLEMS, AND ANSWERS 1.12.1 Exercises and Problems Exercise 1.12.1 Prove that the ﬂooding technique introduced in Section 1.5 is correct, that is, it terminates within ﬁnite time, and all entities will receive the information held by the initiator.

EXERCISES, PROBLEMS, AND ANSWERS

27

Exercise 1.12.2 Determine the exact number of message transmissions required by the protocol Flooding described in Section 1.5. Exercise 1.12.3 In Section 1.5 we have solved the broadcasting problem under the restriction of Bidirectional Links. Solve the problem using the Reciprocal Communication restriction instead. Exercise 1.12.4 In Section 1.5 we have solved the broadcasting problem under the restriction of Bidirectional Links. Solve the problem without this restriction. Exercise 1.12.5 Show that any protocol B can be rewritten so that SSTART consists of only one status. (Hint: Introduce a new input variable.) Exercise 1.12.6 Consider the muddy children problem discussed in Section 1.8.1. Show that, within ﬁnite time, all the children with a muddy forehead can simultaneously determine that they are not clean. (Hint: Use induction on k.) Exercise 1.12.7 Half-duplex links allow communication to go in both directions, but not simultaneously. Design a protocol that implements half-duplex communication between two connected entities, a and b. Prove its correctness and analyze its complexity. Exercise 1.12.8 Half-duplex links allow communication to go in both directions, but not simultaneously. Design a protocol that implements half-duplex communication between three entities, a, b and c, connected to each other. Prove its correctness and analyze its complexity.

1.12.2 Answers to Exercises Answer to Exercise 1.12.1 Let us prove that every entity will indeed receive the message. The proof is by induction on the distance d of an entity from the initiator s. The result is clearly true for d = 0. Assume that it is true for all entities at most at distance d. Let x be a process at distance d + 1 from s. Consider a shortest path s → x1 → . . . → xd−1 → x between s and x. As process xd−1 is at distance d − 1 from s, then by the induction assumption it receives the message. If xd−1 received the message from x, then this means that x already received the message and the proof is completed. Otherwise, xd−1 received the message from a different neighbor, and it then sends the message to all its neighbors, including x. Hence x will eventually receive the message. Answer to Exercise 1.12.2 The total number of messages sent without the improvement was x∈E |N (x)| = 2|E| = 2m; in Flooding, every entity (except the initiator) will send one message less. Hence the total number of messages is 2m − (|V | − 1) = 2m − n + 1.

28

DISTRIBUTED COMPUTING ENVIRONMENTS

Answer to Exercise 1.12.6 (Basis of Induction only) Consider ﬁrst the case k = 1: Only one child, say z, has a dirty forehead. In this case, z will see that everyone else has a clean forehead; as the teacher has said that at least one child has a dirty forehead, z knows that he/she must be the one. Thus, when the teacher arrives, he/she comes forward. Notice that a clean child sees that z is dirty but ﬁnds out that his/her own forehead is clean only when z goes forward. Consider now the case k = 2: There are two dirty children, a and b; a sees the dirty forehead of b and the clean one of everybody else. Clearly he/she does not know about his status; he/she knows that if he/she is clean, b is the only one who is dirty and will go forward when the teacher arrives. So, when the teacher comes and b does not go forward, a understands that his/her forehead is also dirty. (A similar reasoning is carried out by b.) Thus, when the teacher returns the second time, both a and b go forward.

BIBLIOGRAPHY [1] K.M. Chandi and J. Misra. Parallel Program Design: A Foundation. Addison-Wesley, 1988. [2] J.Y. Halpern and Y. Moses. Knowledge and common knowledge in a distributed environment. Journal of the A.C.M., 37(3):549–587, 1987. [3] D.J. Lehmann. Knowledge, common knowledge and related puzzles. In 3rd ACM Symposium on Principles of Distributed Computing, pages 62–67, Vancouver, 1984. [4] N.A. Lynch and M.R. Tuttle. Hierarchical correctness proofs of distributed algorithms. In 6th ACM Symposium on Principles of Distributed Computing (PODC), pages 137–151, Vancouver, 1987. [5] S.J. Rosenschein. Formal theories of AI in knowledge and robotics. New Generation Computing, 3:345–357, 1985.

CHAPTER 2

Basic Problems and Protocols

The aim of this chapter is to introduce some of the basic, primitive, computational problems and solution techniques. These problems are basic in the sense that their solution is commonly (sometimes frequently) required for the functioning of the system (e.g., broadcast and wake-up); they are primitive in the sense that their computation is often a preliminary step or a module of complex computations and protocols (e.g., traversal and spanning-tree construction). Some of these problems (e.g., broadcast and traversal), by their nature, are started by a single entity; in other words, these computational problems have, in their deﬁnition, the restriction unique initiator (UI). Other problems (e.g., wake-up and spanningtree construction) have no such restriction. The computational differences created by the additional assumption of a single initiator can be dramatic. In this chapter we have also included the discussions on the (multiple-initiators) computations in tree networks. Their fundamental importance derives from the fact that most global problems (i.e., problems that, to be solved, require the involvement of all entities), oftentimes can be correctly, easily, and efﬁciently solved by designing a protocol for trees and executing it on a spanning-tree of the network. All the problems considered here require, for their solution, the Connectivity (CN) restriction (i.e., every entity must be reachable from every other entity). In general, and unless otherwise stated, we will also assume Total Reliability (TR) and Bidirectional Links (BL). These three restrictions are commonly used together, and the set R = {BL, CN, TR} will be called the set of standard restrictions. The techniques we introduce in this chapter to solve these problems are basic ones; once properly understood, they form a powerful and an essential toolset that can be effectively employed by every designer of distributed algorithms. 2.1 BROADCAST 2.1.1 The Problem Consider a distributed computing system where only one entity, x, knows some important information; this entity would like to share this information with all the other entities in the system; see Figure 2.1. This problem is called broadcasting (Bcast), Design and Analysis of Distributed Algorithms, by Nicola Santoro Copyright © 2007 John Wiley & Sons, Inc.

29

30

BASIC PROBLEMS AND PROTOCOLS

FIGURE 2.1: Broadcasting Process.

and already we have started its examination in the previous chapter. To solve this problem means to design a set of rules that, when executed by the entities, will lead (within ﬁnite time) to a conﬁguration where all entities will know the information; the solution must work regardless of which entity has the information at the beginning. Built-in the deﬁnition of the problem, there is the assumption, Unique Initiator (UI), that only one entity will start the task. Actually, this assumption is further restricted, because the unique initiator must be the one with the initial information; we shall denote this restriction by UI+. To solve this problem, every entity must clearly be involved in the computation. Hence, for its solution, broadcasting requires the Connectivity (CN) restriction (i.e., every entity must be reachable from every other entity) otherwise some entities will never receive the information. We have seen a simple solution to this problem, Flooding, under two additional restrictions: Total Reliability (TR) and Bidirectional Links (BL). Recall that the set R = {BL, CN, TR} is the set of standard restrictions . 2.1.2 Cost of Broadcasting As we have seen, the solution protocol Flooding uses O(m) messages and, in the worst case, O(d) ideal time units, where d is the diameter of the network. The ﬁrst and natural question is whether these costs could be reduced signiﬁcantly (i.e., in order of magnitude) using a different approach or technique, and if so, by how much. This question is equivalent to ask what is the complexity of the broadcasting problem. To answer this type of questions we need to establish a lower bound: to ﬁnd a bound f (typically, a function of the size of the network) and to prove that the cost of every solution algorithm is at least f. In other words, a lower bound is needed irrespective of the protocol, and it depends solely on the problem; hence, it is an indication of how complex the problem really is. We will denote by M(Bcast/RI+) and T (Bcast/RI+) the message and the time complexity of broadcasting under RI+ = R ∪ UI+, respectively. A lower bound on the amount of ideal time units required to perform a broadcast is simple to derive: Every entity must receive the information regardless of how distant they are from the initiator, and any entity could be the initiator. Hence, in the worst case, T (Bcast/RI+) ≥ Max{d(x, y) : x, y ∈ V } = d.

(2.1)

BROADCAST

31

The fact that Flooding performs the broadcast in d ideal time units means that the lower bound is tight (i.e., it can be achieved) and that Flooding is time optimal. In other words, we know exactly the ideal time complexity of broadcasting: Property 2.1.1 The ideal time complexity of broadcasting under RI+ is ⌰(d). Let us now consider the message complexity. An obvious lower bound on the number of messages is also easy to derive: in the end, every entity must know the information; thus a message must be received by each of the n−1 entities, which initially did not have the information. Hence, M(Bcast/RI+) ≥ n − 1. With a little extra effort, we can derive a more accurate lower bound: Theorem 2.1.1 M(Bcast/RI+) ≥ m. Proof. Assume that there exists a correct broadcasting protocol A which, in each execution, under RI+ on every G, uses fewer than m(G) messages. This means that there is at least one link in G where no message is transmitted in any direction during an execution of the algorithm. Consider an execution of the algorithm on G, and let e = (x, y) ∈ E be the link where no message is transmitted by A. Now construct a new graph G from G by removing the edge e, and adding a new node z and two new edges e1 = (x, z) and e2 = (y, z) (see Fig. 2.2). Set z in a noninitiator status. Run exactly the same execution of A on the new graph G : since no message was sent along (x, y), this is possible. But since no message was sent along (x, y) in the original execution, x and y never send a message to z in the current execution. As a result, z will never receive the information (i.e., change status). This contradicts the fact that A is a correct broadcasting protocol. 䊏

FIGURE 2.2: A message must be sent on each link.

32

BASIC PROBLEMS AND PROTOCOLS

This means that any broadcasting algorithm requires ⍀(m) messages. Since Flooding solves broadcasting with 2m − n + 1 messages (see Exercise 2.9.1), this implies M(Bcast/RI+) ≤ 2m − n + 1. Since the upper bound and the lower bound are of the same order of magnitude, we can summarize Property 2.1.2 The message complexity of broadcasting under RI+ is ⌰(m). The immediate consequence is that, in order of magnitude, Flooding is a messageoptimal solution. Thus, if we want to design a new protocol to improve the 2m − n + 1 cost of Flooding, the best we can hope to achieve is to reduce the constant 2; in any case, because of Theorem 2.1.1, the reduction cannot bring the constant below 1.

2.1.3 Broadcasting in Special Networks The results we have obtained so far apply to generic solutions; that is, solutions that do not depend on G and can thus be applied regardless of the communication topology (provided it is undirected and connected). Next, we will consider performing the broadcast in special networks. Throughout we will assume the standard restrictions plus UI+. Broadcasting in Trees Consider the case when G is a tree; that is, G is connected and contains no cycles. In a tree, m = n−1; hence, the use of protocol Flooding for broadcasting in a tree will cost 2m − (n − 1) = 2(n − 1) − (n − 1) = n − 1 messages. IMPORTANT. This cost is achieved even if the entities do not know that the network is a tree. IMPORTANT. An interesting side effect of broadcasting on a tree is that the tree becomes rooted in the initiator of the broadcast. Broadcasting in Oriented Hypercubes A communication topology that is commonly used as an interconnection network is the (k-dimensional) labeled hypercube, denoted by Hk . A oriented hypercube H1 of dimension k = 1 is just a pair of nodes called (in binary) “0” and “1,” connected by a link labeled “1” at both nodes. A hypercube Hk of dimension k > 1 is obtained by taking two hypercubes of and Hk−1 –and connecting the nodes with the same name dimension k − 1–Hk−1 (respecwith a link labeled k at both nodes; the name of each node in Hk−1 tively Hk−1 ) is then modiﬁed by preﬁxing it with the bit 0 (respectively, 1); see Figure 2.3.

BROADCAST

1

00 1

0

1

2 1

10

3

000

010

2 001

1

3

2

01

2

100

3

2

11

101

1

110

33

2

1

111

3

1 011

FIGURE 2.3: Oriented Hypercube Networks

So, for example, node “0010” in H4 will be connected to node “0010” in H4 by a link labeled l = 5, and their names will become “00010” and “10010,” respectively. This labeling l of the links is symmetric (i.e., lx (x, y)= ly (x, y)) and is called the dimensional labeling of a hypercube. IMPORTANT. These names are used only for descriptive purposes; they are not known to the entities. By contrast, the labels of the links (i.e., the port numbers) are known to the entities by the Local Orientation axiom. A hypercube of dimension k has n = 2k nodes; each node has k links, labeled 1, 2, . . . , k. Hence the total number of links is m = nk/2 = (n/2) log n = O(n log n). A straightforward application of Flooding in a hypercube will cost 2m − (n − 1) = n log n − (n − 1) = n log n/2 + 1 = O(n log n) messages. However, hypercubes are highly structured networks with many interesting properties. We can exploit these special properties to construct a more efﬁcient broadcast. Obviously, if we do so, the protocol cannot be used in other networks. Consider the following simple strategy.

34

BASIC PROBLEMS AND PROTOCOLS

Strategy HyperFlood: 1. The initiator sends the message to all its neighbors. 2. A node receiving a message from the link labeled l will send the messages only to those neighbors with label l < l. NOTE. The only difference between HyperFlood and the normal Flooding is in step 2: Instead of sending the message to all neighbors except the sender, the entity will forward it only to some of them, which will depend on the label of the port from where the message is received. As we will see, this strategy correctly performs the broadcast using only n − 1 messages (instead of O(n log n)). Let us ﬁrst examine termination and correctness. Let Hk (x) denote the subgraph of Hk induced by the links where messages are sent by HyperFlood when x is the initiator. Clearly every node in Hk (x) will receive the information. Lemma 2.1.1

HyperFlood correctly terminates.

Proof. Let x be the initiator; starting from x, the messages are sent only on links with decreasing labels, and if y receives the message from link 4 it will forward it only to the ports 1, 2, and 3. To prove that every entity will receive the information sent by x, we need to show that, for every node y, there is a path from x to y such that the sequence of the labels on the path from x to y is decreasing. (Note that the labels on the path do not need to be consecutive integers.) To do so we will use the following property of hypercubes. Property 2.1.3 In a k-dimensional hypercube Hk , any node x is connected to any other node y by a path π ∈ ˙[x, y] such that ⌳(π) is a decreasing sequence. Proof. Consider the k-bit names of x and of y in Hk : xk , xk−1 , . . . , x1 , x0 and yk , yk−1 , . . . , y1 , y0 . If x = y, these two strings will differ in t ≥ 1 positions. Let j1 , j2 , . . . , jt be the positions in decreasing order; that is, ji > ji+1 . Consider now the nodes v0 , v1 , v2 , . . . , vt , where v0 = x, and the name of vi differs from the name of vi+1 only in the ji+1 -th position. Thus, there is a link labeled ji+1 connecting vi to vi+1 , and clearly vt = y. But this means that v0 , v1 , v2 , . . . , vt is a path from x to y, and the sequence of labels on this path is j1 , j2 , . . . , jt , which is decreasing. 䊏 Thus, Hk (x) is connected and spans (i.e., it contains all the nodes of) Hk , regardless of x. In other words, within ﬁnite time, every entity will have the information. 䊏 Let us now concentrate on the cost of HyperFlood. First of all observe that M[HyperFlood/Hk ] = n − 1.

(2.2)

BROADCAST

35

To prove that only n − 1 messages will be sent during the broadcast, we just need to show that every entity will receive the information only once. This is true because, for every x, Hk (x) contains no cycles (see Exercise 2.9.9). Also as an exercise it is left the proof that for every x, the eccentricity of x in Hk (x) is k (see Exercise 2.9.10); this implies that the ideal time delay of HyperFlood in Hk is always k. That is, T[HyperFlood/Hk ] = k

(2.3)

These costs are the best that any broadcast algorithm can perform in a hypercube regardless of how much more knowledge they have. However, they are obtained here under the additional restriction that the network is a k-dimensional hypercube with a dimensional labeling; that is, under H = {(G, l) = Hk }. Summarizing, we have Property 2.1.4 The ideal time complexity of broadcasting in a k-dimensional hypercube with a dimensional labeling under RI+ is ⌰(k).

Property 2.1.5 The message complexity of broadcasting in a k-dimensional hypercube with a dimensional labeling under RI+ is ⌰(n).

IMPORTANT. The reason why we are able to “bypass” the ⍀(m) lower bound expressed by Theorem 2.1.1 is because we are restricting the applicability of the protocol.

Broadcasting in Complete Graphs Among all network topologies, the complete graph is the one with the most links: Every entity is connected to all others; thus m = n(n − 1)/2 = O(n2 ) (recall we are considering bidirectional links), and d = 1. The use of a generic protocol will require O(n2 ) messages. But this is really unnecessary. Broadcasting in a complete graph is easily accomplished: Because everybody is connected to everybody else, the initiator just needs to send the information to its neighbors (i.e., execute the command “send(I) to N(x)”) and the broadcast is completed. This uses only n − 1 messages and d = 1 ideal time. Clearly this protocol, KBcast, works only in a complete graph, that is under the additional restriction K ≡ “G is a complete graph.” Summarizing Property 2.1.6 The message and the ideal time complexity of broadcasting in a complete graph under RI+ is ⌰(k) are M(Bcast/RI+ ; K) = n − 1 and T (Bcast/RI+ ; K) = 1, respectively.

36

BASIC PROBLEMS AND PROTOCOLS

FIGURE 2.4: Wake-Up Process.

2.2 WAKE-UP 2.2.1 Generic Wake-Up Very often, in a distributed environment, we are faced with the following situation: A task must be performed in which all the entities must be involved; however, only some of them are independently active (because of a spontaneous event, or having ﬁnished a previous computation) and ready to compute, the others are inactive, not even aware of the computation that must take place. In these situations, to perform the task, we must ensure that all the entities become active. Clearly, this preliminary step can only be started by the entities that are active already; however, they do not know which other entities (if any) are already active. This problem is called Wake-up (Wake-Up): An active entity is usually called awake, an inactive (still) one is called asleep; the task is to wake all entities up; see Figure 2.4. It is not difﬁcult to see the relationship between broadcasting and wake-up: Broadcast is a wake-up with only one initially awake entity; conversely, wake-up is a broadcast with possibly many initiators (i.e., initially more than one entity has the information). In other words, broadcast is just a special case of the wake-up problem. Interestingly, but not surprisingly, the ﬂooding strategy used for broadcasting actually solves the more general Wake-Up problem. The modiﬁed protocol, called WFlood, is described in Figure 2.5. Initially all entities are asleep; any asleep entity can become spontaneously awake and start the protocol. It is not difﬁcult to verify that the protocol correctly terminates under the standard restrictions (Exercise 2.9.7). Let us concentrate on the cost of protocol WFlood. The number of messages is at least equal to that of broadcast; actually, it is not much more (see Exercise 2.9.6): 2m ≥ M[WFlood] ≥ 2m − n + 1.

(2.4)

As broadcast is a special case of wake-up, not much improvement is possible (except perhaps in the size of the constant): M(Wake-Up/R) ≥ M(Bcast/RI+) = ⍀(m) The ideal time will, in general, be smaller than the one for broadcast: T (Bcast/RI+) ≥ T (Wake-Up/R)

WAKE-UP

37

PROTOCOL WFlood .

Status Values: S = {ASLEEP,AWAKE}; SI NI T = {ASLEEP}; ST ERM = {AWAKE}.

Restrictions: R. ASLEEP

Spontaneously begin send(W ) to N (x); become AWAKE; end Receiving(W) begin send(W) to N (x) − {sender}; become AWAKE; end

FIGURE 2.5: Wake-Up by Flooding

However, in the case of a single initiator, the two cases coincide. As upper and lower bounds coincide in order of magnitude, we can conclude that protocol WFlood is both message and, worst case in the time optimal. The complexity of Wake-Up is summarized by the following two properties, Property 2.2.1 The message complexity of Wake-up under R is ⌰(m). Property 2.2.2 The worst case ideal time complexity of Wake-up under R is ⌰(d). 2.2.2 Wake-Up in Special Networks Trees The cost of using protocol WFlood for wake-up will depend on the number of initiators. In fact, if there is only one initiator, then this is just a broadcast and costs only n − 1 messages. By contrast, if every entity starts independently, there will be a total of 2(n − 1) messages. Let k denote the number of initiators; note that this number is not a system parameter like n or m, it is, however, bounded by a system parameter: k ≤ n. Then the total number of messages when executing WFlood in a tree will be exactly M[WFlood/Tree] = n + k − 2.

(2.5)

Labeled Hypercubes In Section 2.1, by exploiting the properties of the hypercube and of the dimensional labeling, we have been able to construct a broadcast protocol, which uses only O(n) messages, instead of the ⍀(n log n) messages required by any generic protocol.

38

BASIC PROBLEMS AND PROTOCOLS

Let us see if we can achieve a similar result also for the wake-up. In other words, can we exploit the properties of a labeled hypercube to do better than generic protocols? The answer is, unfortunately, NO. Lemma 2.2.1

M(Wake-Up/R ; H ) = ⍀(n log n).

As a consequence, we might as well employ the generic protocol WFlood, which uses O(n log n) messages. Summarizing, Property 2.2.3 The message complexity of wake-up under R in a k-dimensional hypercube with a dimensional labeling is ⌰(n log n). Complete Graphs Let us focus on wake-up in a complete graph. The use of the generic protocolWFlood will require O(n2 ) messages. We can obviously use the simpliﬁed broadcast protocol KBcast we developed for complete graphs. The number of messages transmitted will be k (n − 1), where k denotes the number of initiators. Even in the worst case (when every entity is independently awake and they all simultaneously start the protocol) O(n2 ) messages will be transmitted. Let us see if, by exploiting the properties of complete graphs, we have been able to construct a wake-up protocol that uses only O(n) messages, instead of the O(n2 ) we have achieved so far. (After all, we have been able to do it in the case of the broadcast problem.) Surprisingly, also in this case, the answer is NO. Lemma 2.2.2

M(Wake-Up/R ; K) = ⍀(n2 ).

This implies that the use of WFlood for wake-up is a message-optimal solution. In other words, Property 2.2.4 The message complexity of wake-up under R in a complete network is ⌰(n2 ). Complete Graphs with ID To reduce the number of messages, a more restricted environment is required; that is, we need to make additional assumptions. For example, if we add the restriction that the entities have unique names (restriction Initial Distinct values (ID)), then there are protocols capable of performing wake-up with O(n log n) messages in a complete graph; they are not simple and actually solve a much more complex problem, Election, which we will discuss at length in Chapter 3. Strangely, nothing better than that can be accomplished. In fact, let IR + K = R ∪ K; then the worst case message complexity of wake-up in a complete graph under the standard restrictions R plus ID is as follows: Property 2.2.5 M(Wake-Up/R; ID; K) ≥ 0.5n log n.

WAKE-UP

39

To see why this is true, we will construct a “bad” but possible case, which any protocol can encounter, and show that, in such a case, O(n log n) messages will be exchanged. The lower bound will hold even if there is message ordering. For simplicity of discussion and calculation, we will assume that n is a power of 2; the results hold also if this is not the case. To construct the “bad” case for an (arbitrary) solution protocol A, we will consider a game between the entities on one side and an adversary on the other: the entities obey the rules of the protocol; the adversary will try to make the worst possible scenario occur, so, to force the use of as many messages as possible. The adversary has the following four powers: 1. it decides the initial values of the entities (they must be distinct); 2. it decides which entities spontaneously start the execution of A, and when; 3. it decides when a transmitted message arrives (it must be within ﬁnite time); and 4. importantly, it decides the matching between links and labels: Let e1 , e2 , . . . , ek be the links incident on x, and let l1 , l2 , . . . , lk be the port labels to be used by x for those links; during the execution, when x performs a “send to l” command, and l has not been assigned yet, the adversary will choose which of the unused links (i.e., through which no messages has been sent nor received) the label l will be assigned to. NOTE. Sending a message to more than one port will be treated as sending the message to each of those ports one at a time (in an arbitrary order). Whatever the adversary decides, it can happen in a real execution. Let us see how bad a case can the adversary create for A. Two sets of entities will be said to be connected at a time t if at least a message has been transmitted from an entities of one set to an entity of the other. Adversary’s Strategy. 1. Initially, the adversary will wake up only one entity s, which we will call the seed, and which will start the execution of the protocol. When s decides to send a message to port number l, the adversary will wake up another entity y and assign label l to the edge from s to y. It will then delay the transmission on that link until also y decides to send a message to some port number l ; the adversary will then assign label l to the link from y to s and let the two messages arrive to their destination simultaneously. In this way, each message will reach an awake node, and the two entities are connected. From now on, the adversary will act in a similar way; always ensure that messages are sent to already-awake nodes, and that the set of awake nodes is connected.

40

BASIC PROBLEMS AND PROTOCOLS

2. Consider an entity x executing a send operation to an unassigned label a. (a) If x has an unused link (i.e., a link on which no messages have been sent so far) connecting it to an awake node, the adversary will assign a to that link. In other words, the adversary will always try to make the awake entities send messages to other awake entities. (b) If all links between x and the awake nodes have been used, then the adversary will create another set of awake nodes and connect the two sets. i. Let x0 , . . . , xk−1 be the currently awake nodes, ordered according to their wake-up time (thus, x0 = s is the seed, and x1 = y). The adversary will perform the following function: choose k inactive nodes z0 , . . . , zk−1 ; establish a logical correspondence between xj and zj ; assign initial values to the new entities so that the order among them is the same as the one among the values of the corresponding entities; wake up these entities and force them to have the “same” execution (same scheduling and same delays) as already did the corresponding ones. (So, z0 will be woken up ﬁrst, its ﬁrst message will be sent to z1 , which will be woken up next and will send a message to z0 , and so forth) ii. The adversary will then assign label a to the link connecting x to its corresponding entity z in the new set; the message will be held in transit until z (like x did) will need to transmit a message on an unused link (say, with label b) but all the edges connecting it to its set of awake entities have already been used. iii. When this happens, the adversary will assign the label b to the link from z to x and make the two messages between x and z arrive and be processed. Let us summarize the strategy of the adversary: The adversary tries to force the protocol to send messages only to already-awake entities and awakens new entities only when it cannot do otherwise; the newly awake entities are equal in number to the already awake entities; and they are forced by the adversary to have the same execution between them as did the other entities before any communication takes place between the two sets. When this happens, we will say that the adversary has started a new stage. Let us now examine the situations created by the adversary with this strategy and analyze the cost of the protocol in the corresponding executions. Let Active(i) denote the awake entities in stage i and New(i) = Active(i) − Active(i − 1) the entities that the adversary woke up in this stage; initially, Active(0) is just the seed. The newly awake entities are equal in number to the already awake entities; that is, |New(i)| = |Active(i − 1)|). Let µ(i − 1) denote the total number of messages, which have been exchanged before the activation of the new entities. The adversary forces the new entities to have the same execution as did the entities in Active(i − 1), thus exchanging µ(i − 1) of messages, before allowing the two sets to become connected. Thus, the total number of messages until the communication between the two sets takes place is 2µ(i − 1).

TRAVERSAL

41

Once the communication takes place, how many messages (including those two) are transmitted before the next stage? The exact answer will depend on the protocol A, but regardless of which protocol we are using, the adversary will not start a new stage i + 1 unless it is forced to; this will happen only if an entity x issues a “send to l” command (where l is an unassigned label) and all the links connecting x to the other awake entities have already been used. This means that x must have either sent to or received from all the entities in Active(i) = Active(i − 1) ∪ New(i). Assume that x ∈ Active(i − 1); then, of all these messages, the ones between x and New(i) have only occurred in stage i (since those entities were not active before); this means that at least |New(i)| = |Active(i −1)| additional messages are sent before stage i + 1. If instead x ∈ New(i), these messages have all been transmitted in this stage (as x was not awake before); in other words, even in this case, |New(i)| = |Active(i −1)| additional messages are sent before stage i + 1. Summarizing, the total cost µ(i − 1) before stage i is thus doubled and at least additional |Active(i −1)| messages are sent before stage i + 1. In other words, µ(i) ≥ 2 µ(i − 1) + |Active(i −1)|. As the awake entities double in each stage, and initially only the seed is active, then |Active(i)| = 2i . Hence, observing that µ(0) = 0, µ(i) ≥ 2 µ(i − 1) + 2i−1 ≥ i 2i−1 . The total number of stages is exactly log n as the awake processes double every stage. Hence, with this strategy, the adversary can force any protocol to transmit at least µ(log n) messages. As µ(log n) ≥ 0.5 n log n it follows that any wake-up protocol will transmit ⍀(n log n) messages in the worst case even if the entities have distinct identiﬁers (ids). More efﬁcient wake-up protocols can be derived if we have in our system a “good” labeling of the links instead.

2.3 TRAVERSAL Traversal of the network allows every entity in the network to be “visited” sequentially (one after the other). Its main uses are in the control and management of a shared resource and in sequential search processes. In abstract terms, the traversal problem starts with an initial conﬁguration where all entities are in the same state (say unvisited) except the one that is visited and is the sole initiator; the goal is to render all the entities visited but sequentially (i.e., one at the time). A traversal protocol is a distributed algorithm that, starting from the single initiator, allows a special message called “traversal token” (or simply, token), to reach every

42

BASIC PROBLEMS AND PROTOCOLS

entity sequentially (i.e., one at the time). Once a node is reached by the token, it is marked as “visited.” Depending on the traversal strategy employed, we will have different traversal protocols. 2.3.1 Depth-First Traversal A well known strategy is the depth-ﬁrst traversal of a graph. According to this strategy, the graph is visited (i.e., the token is forwarded) trying to go forward as long as possible; if it is forwarded to an already visited node, it is sent back to the sender, and that link is marked as a back-edge; if the token can no longer be forwarded (it is at a node where all its neighbors have been visited), the algorithm will “backtrack” until it ﬁnds an unvisited node where the token can be forwarded to. The distributed implementation of depth-ﬁrst traversal is straightforward. 1. When ﬁrst visited, an entity remembers who sent the token, creates a list of all its still unvisited neighbors, forwards the token to one of them (removing it from the list), and waits for its reply returning the token. 2. When the neighbor receives the token, it will return the token immediately if it had been visited already by somebody else, notifying that the link is a backedge; otherwise, it will ﬁrst forward the token to each of its unvisited neighbors sequentially, and then reply returning the token. 3. Upon the reception of the reply, the entity forwards the token to another unvisited neighbor. 4. Should there be no more unvisited neighbors, the entity can no longer forward the token; it will then send the reply, returning the token to the node from which it ﬁrst received it. NOTE. When the neighbor in step (2) determines that a link is a back-edge , it knows that the sender of the token is already visited; thus, it will remove it from the list of unvisited neighbors. We will use three types of messages: “T” to forward the token in the traversal, “Backedge” to notify the detection of a back-edge, and “Return” to return the token upon local termination. Protocol DF Traversal is shown in Figure 2.6, where the operation of extracting an element from a set B and assigning it to variable a is denoted by a ⇐ B. Let us examine its costs. Focus on a link (x,y)∈ E. What messages can be sent on it? Suppose x sends T to y; then y will only send to x either Return (if it was idle when the T arrived) or Backedge (otherwise). In other words, on each link there will be exactly two messages transmitted. Since the traversal is sequential, T[DF Traversal ] = M[DF Traversal ]; hence T[DF Traversal] = M[DF Traversal] = 2m.

(2.6)

TRAVERSAL

43

PROTOCOL DF Traversal.

Status: S = {INITIATOR,IDLE,VISITED,DONE}; SINIT = {INITIATOR,IDLE}; STERM = {DONE}.

Restrictions: R ;UI. INITIATOR

Spontaneously begin Unvisited:= N (x); initiator:= true; VISIT; end IDLE Receiving (T ) begin entry: = sender; Unvisited: = N (x) − {sender}; initiator: = false; VISIT; end VISITED Receiving (T ) begin Unvisited: = Unvisited −{sender}; send(Backedge) to {sender}; end Receiving(Return) begin VISIT; end Receiving(Backedge) begin VISIT; end Procedure VISIT begin if Unvisited = ∅ then next ⇐ Unvisited; send(T) to next; become VISITED else if not(initiator) then send(Return) to entry; endif become DONE; endif end

FIGURE 2.6: DF Traversal

To determine how efﬁcient is the protocol, we are going to determine what is the complexity of the problem. Using exactly the same technique we employed in the proof of Theorem 2.1.1, we have (Exercise 2.9.11): Theorem 2.3.1 M(DFT/R) ≥ m.

44

BASIC PROBLEMS AND PROTOCOLS

Therefore, the 2m message cost of protocol DF Traversal is indeed excellent, and the protocol is message optimal. Property 2.3.1 The message complexity of depth-ﬁrst traversal under R is ⌰(m). The time requirements of a depth-ﬁrst traversal are quite different from those of a broadcast. In fact, since each node must be visited sequentially, starting from the sole initiator, the time complexity is at least the number of nodes: Theorem 2.3.2 T (DFT/R) ≥ n − 1. The time complexity of protocol DF Traversal is dreadful. In fact, the upper bound 2m could be several order of magnitude larger than the lower bound n − 1. For example, in a complete graph, 2m = n2 − n. Some signiﬁcant improvements in the time complexity can, however, be made by going into a ﬁner granularity. We will discuss this topic in greater details next. 2.3.2 Hacking () Let us examine protocol Protocol DF Traversal to see if it can be improved, especially its time cost. IMPORTANT. When measuring ideal time, we consider only synchronous executions; however, when measuring messages and establishing correctness we must consider every possible schedule of events, especially the nonsynchronous executions. Basic Hacking The protocol we have constructed is totally sequential: in a synchronous execution, at each time unit only one message will be sent, and every message requires one unit of time. So, to improve the time complexity, we need to (1) reduce the number of messages and/or (2) introduce some concurrency. By deﬁnition of traversal, each entity must receive the token (message T) at least once. In the execution of our protocol, however, some entities receive it more than once; those links from which these other T messages arrive are precisely the backedges. Question. Can we avoid sending T messages on back-edges? To answer this question we must understand why T messages are sent on back-edges. When an entity x sends a T message to y, it does not know whether the link is a back-edge or not; that is, whether y has already been visited by somebody else or not. If x knew which of its neighbors are already visited, it would not send a T message to them, there would be no need for Backedge messages from them, and we would be saving messages and time. Let us examine how to achieve such a condition.

TRAVERSAL

45

Suppose that, whenever a node is visited (i.e., it receives T) for the ﬁrst time, it notiﬁes all its (other) neighbors of this event (e.g., sending a “Visited” message) and waits for an acknowledgment (e.g., receiving an “Ack” message) from them before forwarding the token. The consequence of such a simple act is that now an entity ready to forward the token (i.e., to send a T message) really knows which of its neighbors have already been visited. This is exactly what we wanted. The price we have to pay is the transmission of the Visited and Ack messages. Notice that now an idle entity (that is an entity that has not yet been involved in the traversal) might receive a Visited message as its ﬁrst message. In the revised protocol, we will make such an entity enter a new status, available. Let us examine the effects of this change on the overall time cost of the protocol; call DF+ the resulting protocol. The time is really determined by the number of sequential messages. There are four types of messages that are sent: T, Return, Visited, and Ack. Each entity (except the initiator) will receive only one T message and send only one Return message; the initiator does not receive any T message and does not send any Return; thus, in total there will be 2(n − 1) such messages. Since all these communications occur sequentially (i.e., without any overlap), the time taken by sending the T and Return messages will be 2(n − 1). To determine how many ideal time units are added by the transmission of Visited and Ack messages, consider an entity: its transmission of all the Visited messages takes only a single time unit, since they are sent concurrently; the corresponding Ack messages will also be sent concurrently, adding an additional time unit. Since every node will do it, the sending of the Visited messages and receiving the Ack messages will increase the ideal time of the original algorithm by exactly 2n. This will give us a time cost of T[DF+] = 4n − 2.

(2.7)

It is also easy to compute how many messages this will cost. As mentioned above, there is a total of 2(n − 1) T and Return messages. In addition, each entity (except the initiator) sends a Visited message to all its neighbors except the one from which it received the token; the initiator will send it to all its neighbors. Thus, denoting by s the initiator, the total number of Visited messages is |N (s)| + x=s (|N (x)| − 1) = 2m − (n − 1). Because for each Visited message there will be an Ack, the total message cost will be M[DF+] = 4m − 2(n − 1) + 2(n − 1) = 4m.

(2.8)

Summarizing, we have been able to reduce the time costs from O(m) to O(n) that, because of Theorem 2.3.2, is optimal. The price has been the doubling of the number of messages. Property 2.3.2 The ideal time complexity of depth-ﬁrst traversal under R is ⌰(n).

46

BASIC PROBLEMS AND PROTOCOLS

Advanced Hacking Let us see if the number of messages can be decreased without signiﬁcantly increasing the time costs. Question. Can we avoid sending the Ack messages? To answer this question we must understand what would happen if we do not send Ack messages. Consider an entity x that sends Visited to its neighbors; (if we no longer use Ack) x will proceed immediately with forwarding the token. Assume that, after some time, the token arrives, for the ﬁrst time, to a neighbor z of x (see Fig. 2.7); it is possible that the Visited message sent by x to z has not arrived yet (due to communication delays). In this case, z would not know that x has already been visited and would send the T message to it. That is, we will again send a T message on a back-edge undoing what we had accomplished with the previous change to the protocol. But the algorithm now is rather different (we are using Visited messages, no longer Backedge messages) and this situation might not happen all the time. Still, if it happens, z will eventually receive the Visited message from x (recall we are operating under total reliability); z can then understand its mistake, pretend nothing happened (just the waste of a T message), and continue like T message was never really sent. On its side, x upon receiving the token will also understand that z made a mistake and ignore the message; x also realizes (if it did not know already) that z is visited and will remove it from its list of unvisited neighbors. Although the correctness will not be affected (Exercise 2.9.15), mistakes cost additional messages. Let us examine what is really the cost of this modiﬁed protocol, which we shall call DF++. As before, the “correct” T and Return yield a total of 2n − 2 messages, and the Visited messages are 2m − n + 1 in total. Then there are the “mistakes”; each mistake costs one message. The number of mistakes can be very large. In fact, unfriendly time delays can force mistakes to

X

X

X

T Visited

Y

Visited

Y

Visited

Y T

T Z

(a)

Z

(b)

Z

(c)

FIGURE 2.7: Slow Visited message : z does not know that x has been visited.

TRAVERSAL

47

occur on every back-edge; on some back-edges, there can be two mistakes, one in each direction. (Exercise 2.9.16). In other words, there will be at most 2(m − n + 1) incorrect T messages. Summing up all, this yields M[DF++] ≤ 4m − n + 1.

(2.9)

Let us consider now the time. We have an improvement in that the Ack messages are no longer sent, saving n time units. As there are no more Ack to wait for, an entity can forward the token at the same time as the transmission of the Visited messages; if it does not have any unvisited neighbor to send the T to, the entity will send the Return at the same time as the Visited. Hence, the sending of the Visited is done in overlap with the sending of either a T or a Return message, saving another n time units. In other words, without considering the mistakes, the total time will be 2n − 2. Let us now also consider the mistakes and evaluate the ideal time of the protocol. Strange as it might sound, when we attempt to measure the ideal execution time of this protocol, in the execution no mistakes will ever occur. This is because mistakes can only occur owing to arbitrarily long communication delays; on the contrary, ideal time is only measured under unitary delays. But under unitary delays there are no mistakes. Therefore, T[DF++] = 2n − 2.

(2.10)

IMPORTANT. It is crucial to understand this inherent limit of the cost measure we call ideal time. Unlike the number of messages, ideal time is not a “neutral” measure; it inﬂuences (thus limiting) the nature of what we want to measure. In other words, it should be treated and handled with caution. Even greater caution should be employed in interpreting the results it gives. Extreme Hacking As we are on a roll, let us observe that we could actually use the T message as an implicit Visited, saving some additional messages. This saving will happen at every entity except those that, when they are reached for the ﬁrst time by a T message, do not have any unvisited neighbor. Let f denote the number of these nodes; thus the number of Visited messages we save is n − f . Hence, the total number of messages is 4m − n + 1 − n + f . Summarizing, the cost of the optimized protocol, called DF and described in Figures 2.8 and 2.9, is as follows: T[DF] = 2n − 2.

(2.11)

M[DF] = 4m − 2n + f + 1.

(2.12)

48

BASIC PROBLEMS AND PROTOCOLS

PROTOCOL DF

Status: S = {INITIATOR,IDLE,AVAILABLE,VISITED,DONE}; SI NI T = {INITIATOR,IDLE}; ST ERM = {DONE}.

Restrictions: R ;UI. INITIATOR

Spontaneously begin initiator:= true; Unvisited:= N (x); next ⇐ Unvisited; send(T) to next; send(Visited) to N(x)-{next}; become VISITED end IDLE Receiving(T ) begin Unvisited:= N (x); FIRST-VISIT; end Receiving(Visited) begin Unvisited:= N (x) − {sender}; become AVAILABLE end AVAILABLE Receiving(T) FIRST-VISIT; Receiving(Visited) begin Unvisited:= U nvisited − {sender}; end VISITED Receiving(Visited) begin Unvisited:= Unvisited −{sender}; if next = sender then VISIT; endif end Receiving(T) begin Unvisited:= Unvisited −{sender}; if next = sender then VISIT; endif end Receiving(Return) begin VISIT; end

FIGURE 2.8: Protocol DF

TRAVERSAL

49

Procedure FIRST-VISIT begin initiator:= false; entry:=sender; Unvisited:= Unvisited-{sender}; if Unvisited = ∅ then next ⇐ Unvisited; send(T) to next; send(Visited) to N(x)−{entry,next}; become VISITED; else send(Return) to {entry}; send(Visited) to N(x)−{entry}; become DONE; endif end Procedure VISIT begin if Unvisited = ∅ then next ⇐ Unvisited; send(T) to next; else if not(initiator) then send(Return) to entry; endif become DONE; endif end

FIGURE 2.9: Routines used by Protocol DF*

IMPORTANT. The value of f , unlike n and m, is not a system parameter. In fact, it is execution-dependent.: it may change at each execution value. We shall indicate this fact (for f as well as for any other execution-dependent value) by the use of the subscript .

2.3.3 Traversal in Special Networks Trees In a tree network, depth-ﬁrst traversal is particularly efﬁcient in terms of messages, and there is no need of any optimization effort (hacking). In fact, in any execution of DF Traversal in a tree, no Backedge messages will be sent (Exercise 2.9.12). Hence, the total number of messages will be exactly 2(n − 1). The time complexity is the same as the optimized version of the protocol: 2(n − 1). M[DF Traversal/Tree] = T[DF Traversal/Tree] = 2n − 2

(2.13)

An interesting side effect of a depth-ﬁrst traversal of a tree is that it constructs a virtual ring on the tree (Figure 2.10). In this ring some nodes appear more than once; in fact the ring has size 2n − 2 (Exercise 2.9.13). This fact will have useful consequences.

50

BASIC PROBLEMS AND PROTOCOLS

a

Virtual Node Real Node d

b

c e

f

g

h

FIGURE 2.10: Virtual ring created by DF Traversal.

Rings In a ring network, every node has exactly two neighbors. Depth-ﬁrst traversal in a ring can be achieved in a simple way: the initiator chooses one direction and the token is just forwarded along that direction; once the token reaches the initiator, the traversal is completed. In other words, each entity will send and receive a single T message. Hence both the time and the message costs are exactly n. Clearly this protocol can be used only in rings. Complete Graph In a complete graph, execution of DF* will require O(n2 ) messages. Exploiting the knowledge of being in a complete network, a better protocol can be derived: the initiator sequentially will send the token to all its neighbors (which are the other entities in the network); each of this entities will return the token to the initiator without forwarding it to anybody else. The total number of messages is 2(n − 1), and so is the time. 2.3.4 Considerations on Traversal Traversal as Access Permission The main use of a traversal protocol is in the control and management of shared resources. For example, access to a shared transmission medium (e.g., bus) must be controlled to avoid collisions (simultaneous frame transmission by two or more entities). A typical mechanism to achieve this is by the use of a control (or permission) token. This token is passed from one entity to another according to the same set of rules. An entity can only transmit a frame when it is in possession of the token; once the frame has been transmitted, the token is passed to another entity. A traversal protocol by deﬁnition “passes” the token sequentially through all the entities and thus solves the access control problem. The only proviso is that, for the access permission problem, it must be made continuous: once a traversal is terminated, another must be started by the initiator.

PRACTICAL IMPLICATIONS: USE A SUBNET

51

The access permission problem is part of a family of problems commonly called Mutual Exclusion, which will be discussed in details later in the book. Traversal as Broadcast It is not difﬁcult to see that any traversal protocol solves the broadcast problem: the initiator puts the information in the token message; every entity will be visited by the token and thus will receive the information. The converse is not necessarily true; for example, Flooding violates the sequentiality requirement since the message is sent to all (other) neighbors simultaneously. The use of traversal to broadcast does not lead to a more efﬁcient broadcasting protocol. In fact, a comparison of the costs of Flooding and DF* (Expressions 1.1 and 2.12) shows that Flooding is more efﬁcient in terms of both messages and ideal time. This is not surprising since a traversal is constrained to be sequential; ﬂooding, by contrast, exploits concurrency at its outmost.

2.4 PRACTICAL IMPLICATIONS: USE A SUBNET We have considered three basic problems (broadcast, wake-up, and depth-ﬁrst traversal) and studied their complexity, devised solution protocols and analyzed their efﬁciency. Let us see what the theoretical results we have obtained tell us about the situation from a practical point of view. We have seen that generic protocols for broadcasting and wake-up require ⍀(m) messages (Theorem 2.1.1). Indeed, in some special networks, we can sometimes develop topology-dependent solutions and obtain some improvements. A similar situation exists for generic traversal protocols: They all require ⍀(m) messages (Theorem 2.3.1); this cost cannot be reduced (in order of magnitude) unless we make additional restrictions, for example, exploiting some special properties of G of which we have a priori (i.e., at design time) knowledge. In any connected, undirected graph G, we have (n2 − n)/2 ≥ m ≥ n − 1, and, for every value in that range, there are networks with those many links; in particular, m = (n2 − n)/2 occurs when G is the complete graph, and m = n − 1 when G is a tree. Summarizing, the cost of broadcasting, wake-up, and traversal depends on the number of links: The more links the greater the cost; and it can be as bad as O(n2 ) messages per execution of any of the solution protocols. This result is punitive for networks where a large investment has been made in the construction of communication links. As broadcast is a basic communication tool (in some systems, it is a primitive one) dense networks are penalized continuously. Similarly, larger operating costs will be incurred by dense networks every time a wake-up (a very common operation, used as preliminary step in most computations) or a traversal (fortunately, not such a common operation) is performed.

52

BASIC PROBLEMS AND PROTOCOLS

The theoretical results, in other words, indicate that investments in communication hardware will result in higher operating communication costs. Obviously, this is not an acceptable situation, and it is necessary to employ some “lateral thinking.” The strategy to circumvent the obstacle posed by these lower-bounds (Theorems 2.1.1 and 2.3.1) without restricting the applicability of the protocol is fortunately simple: 1. construct a subnet G of G and 2. perform the operations only on the subnet. If the subnet G we construct is connected and spans G (i. e., contains all nodes of G), then doing broadcast on G will solve the broadcasting problem on G: Every node (entity) will receive the information. Similarly, performing a traversal on G will solve that problem on G. The important consequence is that, if G is a proper subnet, it has fewer links than G; thus, the cost of performing those operations on G will be lower than doing it in G. Which connected spanning subnet of G should we construct? If we want to minimize the message costs, we should choose the one with the fewest number of links; thus, the answer is: a spanning tree of G. So, the strategy for a general graph G will be Strategy Use-a-Tree: 1. construct a spanning tree of G and 2. perform the operations only on this spanning tree. This strategy has two costs. First, there is the cost of constructing the spanning tree; this task will have to be carried out only once (if no failures occur). Then there are the operating costs, that is the costs of performing broadcast, wake-up, and traversal on the tree. Broadcast will cost exactly n − 1 messages, and the cost of wake-up and traversal will be twice that amount. These costs are independent of m and thus do not inhibit investments in communication links (which might be useful for other reasons). 2.5 CONSTRUCTING A SPANNING TREE Spanning-tree construction (SPT) is a classical problem in computer science. In a distributed computing environment, the solution of this problem has, as we have seen, strong practical motivations. It also has distinct formulation and requirements. In a distributed computing environment, to construct a spanning tree of G means to move the system from an initial system conﬁguration, where each entity is just aware of its own neigbors, to a system conﬁguration where 1. each entity x has selected a subset Tree-neighbors(x) ⊆ N (x) and 2. the collection of all the corresponding links forms a spanning tree of G.

CONSTRUCTING A SPANNING TREE

53

What is wanted is a distributed algorithm (specifying what each node has to do when receiving a message in a given status) such that, once executed, it guarantees that a spanning tree T(G) of G has been constructed; in the following we will indicate T(G) simply by T, if no ambiguity arises. Note that T is not known a priori to the entities and might not be known after it has been constructed: an entity needs to know only which of its neighbors are also its neighbors in the spanning tree T. As before, we will restrict ourselves to connected networks with bidirectional links and further assume that no failure will occur. We will ﬁrst assume that the construction will be started by only one entity (i.e., Unique Initiator (UI) restriction); that is, we will consider spanning-tree construction under restrictions RI. We will then consider the general problem when any number of entities can independently start the construction. As we will see, the situation changes dramatically from the single-initiator scenario.

2.5.1 SPT Construction with a Single Initiator: Shout Consider the entities; they do not know G, not even its size. The only things an entity is aware of are the labels on the ports leading to its neighbors (because of the Local Orientation axiom) and the fact that, if it sends a message to a neighbor, the message will eventually be received (because of the Finite Communication Delays axiom and the Total Reliability restriction). How, using just this information, can a spanning tree be constructed? The answer is surprisingly simple. Each entity needs to know which of its neighbors are also neighbors in the spanning tree. The solution strategy is just “ask:”

Strategy Ask-Your-Neighbors: 1. The initiator s will “ask” its neighbors; that is, it will send a message Q = (“Are you my neighbor in the spanning tree"?) to all its neighbors. 2. An entity x = s will reply “Yes” only the ﬁrst time it is asked and, in this occasion, it will ask all its other neighbors; otherwise, it will reply “No.” The initiator s will always reply “No.” 3. Each entity terminates when it has received a reply from all neighbors to which it asked the question. For an entity x, its neighbors in the spanning tree T are the neighbors that have replied “Yes” and, if x = s, also the neighbor from which the question was ﬁrst asked. The corresponding set of rules is depicted in Figure 2.11 where in bold are shown the tree links and in dotted lines the nontree links. The protocol Shout implementing this strategy is shown in Figure 2.12. Initially, all nodes are in status idle except the sole initiator.

54

BASIC PROBLEMS AND PROTOCOLS

YES

Q

Q

Q

Q

YES

Q

NO

NO

TREE LINE NOT−IN−TREE

FIGURE 2.11: Set of Rules of Shout.

Before we discuss the correctness and the efﬁciency of the protocol, consider how it is structured and operates. First of all observe that, in Shout the question Q is broadcasted through the network (using ﬂooding). Further observe that, when an entity receives Q, it always sends a reply (either Yes or No). Summarizing, the structure of this protocol is a ﬂood where every information message is acknowledged. This type of structure will be called Flood + Reply.

CONSTRUCTING A SPANNING TREE

55

PROTOCOL Shout

Status: S = {INITIATOR,IDLE,ACTIVE,DONE}; SI NI T = {INITIATOR,IDLE}; ST ERM = {DONE}.

Restrictions: R ;UI. INITIATOR

Spontaneously begin root:= true; Tree-neighbors:=∅; send(Q) to N (x); counter:=0; become ACTIVE; end IDLE Receiving(Q) begin root:= false; parent:= sender; Tree-neighbors:={sender}; send(Yes) to {sender}; counter:=1; if counter=|N (x)| then become DONE else send(Q) to N (x) − {sender}; become ACTIVE; endif end ACTIVE Receiving(Q) begin send(No) to {sender}; end Receiving(Yes) begin Tree-neighbors:=Tree-neighbors ∪{sender}; counter:=counter+1; if counter=|N (x)| then become DONE; endif end Receiving(No) begin counter:=counter+1; if counter=|N (x)| then become DONE; endif end

FIGURE 2.12: Protocol Shout

Correctness Let us now show that Flood + Reply, as used above, always constructs a spanning tree; that is, the graph deﬁned by all the Tree-neighbors computed by the entities forms a spanning tree of G; furthermore, this tree is rooted in the initiator s.

56

BASIC PROBLEMS AND PROTOCOLS

Theorem 2.5.1 Protocol Shout correctly terminates. Proof. This protocol consists of the ﬂooding of Q, where every Q message is acknowledged. Because of the correctness of ﬂooding, we are guaranteed that every entity will receive Q and by construction will reply (either Yes or No) to each Q it receives. Termination then follows. To prove correctness we must show that the subnet G deﬁned by all the Treeneighbors is a spanning tree of G. First observe that, if x is in Tree-neighbors of y, then y is in Tree-neighbors of x (see Exercise 2.9.18). If an entity x sends a Yes to y, then it is in Tree-neighbors of y; furthermore, it is connected to s by a path where a Yes is sent on each link (see Exercise 2.9.19). Since every x = s sends exactly one Yes, the subnet G deﬁned by all the Tree-neighbors contains all the entities (i.e., it spans G), it is connected, and contains no cycles (see Exercise 2.9.20). Therefore, it is a spanning tree of G. 䊏 Note that G is actually a tree rooted in the initiator. Recall that, in a rooted tree , every node (except the root) has one parent: the neighbor closest to the root; all its other neighbors are called children. The neighbor to which x sends a Yes is its parent; all neighbors from which it receives a Yes are its children. This fact can be useful in subsequent operations. IMPORTANT. The execution of protocol Shout ends with local termination: each entity knows when its own execution is over; this occurs when it enters status done. Notice however that no entity, including the initiator, is aware of global termination (i.e., every entity has locally terminated). This situation is fairly common in distributed computations. Should we need the initiator to know that the execution has terminated (e.g., to start another task), Flood + Reply can be easily modiﬁed to achieve this goal (Exercise 2.9.24). Costs The message costs of Flood+Reply, and thus of Shout, are simple to analyze. As mentioned before, Flood+Reply consists of an execution of Flooding(Q) with the addition of a reply (either Yes or No) for every Q. In other words, M[Flood+Reply] = 2 M[Flooding]. The time costs of Flood+Reply, and thus of Shout, are also simple to determine; in fact (Exercise 2.9.21): T[Flood+Reply] = T[Flooding]+1. Thus M[Shout] = 4m − 2n + 2

(2.14)

T[Shout] = r(s ) + 1 ≤ d + 1

(2.15)

CONSTRUCTING A SPANNING TREE

57

The efﬁciency of protocol Shout can be evaluated better taking into account the complexity of the problem it is solving. Since every node must be involved, using an argument similar to the proof of Theorem 2.1.1, we have: Theorem 2.5.2 M(SPT/RI) ≥ m. Proof. Assume that there exists a correct SPT protocol A that, in each execution under RI on every G, uses fewer than m(G) messages. This means that there is at least one link in G where no message is transmitted in any direction during an execution of the algorithm. Consider an execution of the algorithm on G, and let e = (x, y) ∈ E be the link where no message is transmitted by A. Now construct a new graph G from G by removing the edge e and adding a new node z and two new edges e1 = (x, z) and e2 = (y, z) (see Fig. 2.2). Set z in a noninitiator status. Run exactly the same execution of A on the new graph G : since no message was sent along (x,y), this is possible. But since no message was sent along (x,y) in the original execution in G, x and y never send a message to z in the current execution in G ; and since z is not the initiator and does not receive any message, it will not send any message. Within ﬁnite time, protocol A terminates claiming that a spanning-tree T of G has been constructed; 䊏 however, z is not part of T, and hence T does not span G . And similarly to the broadcast problem we have Theorem 2.5.3 T (SPT/RI) ≥ d. This implies that protocol Shout is both time optimal and message optimal with respect to order of magnitude. In other words, Property 2.5.1 The message complexity of spanning-tree construction under RI is ⌰(m). Property 2.5.2 The ideal time complexity of spanning-tree construction under RI is ⌰(d). In the case of the number of messages some improvement might be possible in terms of the constant. Hacking Let us examine protocol Shout to see if it can be improved, thereby, helping us to save some messages. Question. Do we have to send No messages? When constructing the spanning tree, an entity needs to know who its tree-neighbors are; by construction, they are the ones that reply Yes and, except for the initiator, also

58

BASIC PROBLEMS AND PROTOCOLS

the ones that ﬁrst asked the question. Thus, for this determination, the No messages are not needed. On the contrary hand, the No messages are used by the protocol to terminate in ﬁnite time. Consider an entity x that just sent Q to neighbor y; it is now waiting for a reply. If the reply is Yes, it knows y is in the tree; if the reply is No, it knows y is not. Should we remove the sending of No–how can x determine that y would have sent No? More clearly: Suppose x has been waiting for a reply from y for a (very) long time; it does not know if y has sent Yes and the delays are very long, or y would have sent No and thus will send nothing. Because the algorithm must terminate, x cannot wait forever and has to make a decision. How can x decide? The question is relevant because communication delays are ﬁnite but unpredictable. Fortunately, there is a simple answer to the question that can be derived by examining how protocol Shout operates. Focus on a node x that just sent Q to its neighbor y. Why would y reply No ? It would do so only if it had already said Yes to somebody else; if that happened, y sent Q at the same time to all its other neighbors, including x. Summarizing, if y replies No to x, it must have already sent Q to x. We can clearly use this fact to our advantage: after x sent Q to y, if it receives Yes it knows that y is its neighbor in the tree; if it receives Q, it can deduce that y will deﬁnitely reply No to x’s question. All of this can be deduced by x without having received the No. In other words: a message Q that arrives at a node waiting for a reply can act as an implicit negative acknowledgment; therefore, we can avoid sending No messages. Let us now analyze the message complexity of the resulting protocol Shout+. The time complexity is clearly unchanged; hence T[Shout]+ = r(s ) + 1 ≤ d + 1.

(2.16)

On each link (x, y)∈ E there will be exactly a pair of messages: either Q in one direction and Yes in the other, or two Q messages, one in each direction. Thus M[Shout+] = 2m.

(2.17)

2.5.2 Other SPT Constructions with Single Initiator SPT Construction by Traversal It is well known that a depth-ﬁrst traversal of a graph G actually constructs a spanning tree (df-tree) of that graph. The df-tree is obtained by removing the back-edges from G (i.e., the edges where a Back-edge message was sent in DF Traversal). In other words, the tree-neighbors of an entity x will be those from which it receives a Return message and, if x is not the initiator, the one from which x received the ﬁrst T. Simple modiﬁcations to protocol DF* will ensure that each entity will correctly compute their neighbors in the df-tree and locally terminate in ﬁnite time (Exercise 2.9.25). Notice that these modiﬁcations involve just local bookkeeping and no

CONSTRUCTING A SPANNING TREE

59

additional communication. Hence the time and message costs are unchanged. The resulting protocol is denoted by df − SPT ; then M[df − SPT] = 4m − 2n + f + 1.

(2.18)

T[df − SPT] = 2n − 2.

(2.19)

We can now better characterize the variable f , which appears in the cost above. In fact, f is exactly the number of leaves of the df-tree constructed by df − SPT (Exercise 2.9.26). Expressions 2.18 and 2.19, when compared with the costs of protocol Shout, indicate that depth-ﬁrst traversal is not an efﬁcient tool for constructing a spanning tree; this is particularly true for its very high time costs. Notice that, like in protocol Shout, all entities will become aware of their local termination, but only the initiator will be aware of global termination, that is, that the construction of the spanning tree has been completed (Exercise 2.9.27). SPT Construction by Broadcasting We have just seen how, with simple modiﬁcations, the techniques of ﬂooding and of df-traversal can be used to construct a spanning tree, if there is a unique initiator. This fact is part of a very interesting and more general phenomenon: under RI, the execution of any broadcast protocol constructs a spanning tree. Let us examine this statement in more details. Take any broadcast protocol B; by deﬁnition of broadcast, its execution will result in all entities receiving the information initially held by the initiator. For each entity x different from the initiator, call parent the neighbor from which x received the information for the ﬁrst time; clearly, everybody except the initiator will have only one parent, and the initiator has none. Denote by x y the fact that x is the parent of y; then we have the following property whose proof is left as an exercise (Exercise 2.9.28): Theorem 2.5.4 The parent relationship deﬁnes a spanning tree rooted in the initiator. As a consequence, it would appear that, to solve SPT, we just need to execute a broadcast algorithm without any real modiﬁcation, just adding some local variables (Tree-neighbors) and doing some local bookkeeping. This is generally not the case; in fact, knowing its parent in the tree is not enough for an entity. To solve SPT, when an entity x terminates its execution, it must explicitly know which neighbors are its children as well as which neighbor are not its treeneighbors. If not provided already by the protocol, this information can obviously be acquired. For example, if every entity sends a notiﬁcation message to its parent, the parents will

60

BASIC PROBLEMS AND PROTOCOLS

know their children. To ﬁnd out which neighbors are not children is more difﬁcult and will depend on the original broadcast protocol. In protocol Shout this is achieved by adding the “Yes” (I am your child) and “No” (I am not your child) messages to Flooding. In DF Traversal protocol this is already achieved by the “Return” (I am your child) and the “Backedge” (I am not your child) messages; so, no additional communication is required. This fact establishes a computational relationship between the broadcasting problem and the spanning-tree construction problem. If I know how to broadcast, then (with minor modiﬁcations) I know how to construct a spanning tree with a unique initiator. The converse is also trivially true: Every protocol that constructs a spanning tree solves the broadcasting problem. We shall say that these two problems are computationally equivalent and denote this fact by Bcast ≡ SPT(UI).

(2.20)

Since, as we have discussed in section 2.3.4, every traversal protocol performs a broadcast, it follows that, under RI, the execution of any traversal protocol constructs a spanning tree. SPT Construction by Global Protocols Actually, we can make a much stronger statement. Call a problem global if every entity must participate in its solution; participation implies the execution of a communication activity: transmission of a message and/or arrival of a message (even if it triggers only the Null action, i.e., no action is taken). Both broadcast and traversal are global problems. Now, every single-initiator protocol that solves a global problem P solves also Bcast; thus, from Equation 2.20, it follows that, under RI, the execution of any solution to a global problem P constructs a spanning tree. 2.5.3 Considerations on the Constructed Tree We have seen how, with few more messages than those required by ﬂooding and the same messages as a df-traversal, we can actually construct a spanning tree. As discussed previously, once such a tree is constructed, we can from now on perform broadcast and traversal using only O(n) messages (which is optimal) instead of O(m) (which could be as bad as O(n2 )). IMPORTANT. Different techniques construct different spanning trees. It is even possible that the same protocol constructs different spanning trees when executed at different times. This is for example the case of Shout: Because communication delays are unpredictable, subsequent executions of this algorithm on the same graph may result in different spanning trees. In fact (Exercise 2.9.23) every possible spanning tree of G could be constructed by Shout.

CONSTRUCTING A SPANNING TREE

61

Prior to its execution, it is impossible to predict which spanning tree will be constructed; the only guarantee is that Shout will construct one. This has implications for the time costs of the strategy Use-a-Tree of broadcasting on the spanning tree T instead of the entire graph G. In fact, the broadcast time will be d(T) instead of d(G); but d(T) could be much greater than d(G). For example, if G is the complete graph, the df-tree constructed by any depth-ﬁrst traversal will have d(T ) = n − 1; but d(G) = 1. In general, the trees constructed by depth-ﬁrst traversal have usually terrible diameters. The ones generated by Shout usually perform better, but there is no guarantee on the diameter of the resulting tree. This fact poses the problem of constructing spanning trees that have a good diameter; that is, to ﬁnd a spanning tree T of G such that d(T ) is not much more than d(G). For obvious reasons, such a tree is traditionally called a broadcast tree. To construct a broadcast tree we must ﬁrst understand the relationship between radius and diameter. The eccentricity (or radius) of a node x in G is the longest of its distances to the other nodes: rG (x) = Max{dG (x, y) : y ∈V }. A node c with minimum radius (or eccentricity) is called a center; that is, ∀x ∈ V , rG (c) ≤ rG (x). There might be more than one center; they all, however, have the same eccentricity, denoted by r(G) and are called the radius of G: r(G) = Min{rG (x) : x ∈ V }. There is a strong relationship between the radius and the diameter of a graph; in fact, in every graph G, r(G) ≤ d(G) ≤ 2r(G).

(2.21)

The other ingredient we need is a breadth-ﬁrst spanning tree (bf-tree). A breadthﬁrst spanning tree of G rooted in a node u, denoted by BFT(u, G), has the following property: The distance between a node v and the root in the tree is the same as their distance in the original graph G. The strategy to construct a broadcast tree with diameter d(T ) ≤ 2d(G) is then simple to state: Strategy Broadcast-Tree Construction: 1. determine a center c of G; 2. construct a breadth-ﬁrst spanning tree BFT(c, G) rooted in c. This strategy will construct the desired broadcast tree (Exercise 2.9.29): Theorem 2.5.5 BFT(c, G) is a broadcast tree of G.

62

BASIC PROBLEMS AND PROTOCOLS

To be implemented, this strategy requires that we solve two problems: Center Finding and Breadth-First Spanning-Tree Construction. These problems, as we will see, are not simple to solve efﬁciently; we will examine them in later chapters. 2.5.4 Application: Better Traversal In Section 2.4, we have discussed the general strategy Use-a-Tree for problem solving. Now that we know how to construct a spanning tree (using a single initiator), let us apply the strategy to a known problem. Consider again the traversal problem. Using the Use-a-Tree strategy, we can produce an efﬁcient traversal protocol that is much simpler than all the algorithms we have considered before: Protocol Smart Traversal: 1. Construct, using Shout+, a spanning-tree T rooted in the initiator. 2. Perform a traversal of T, using DF Traversal. The number of messages of SmartTraversal is easy to compute: Shout+ uses 2m messages (Equation 2.17), while DF Traversal on a tree uses exactly 2(n − 1) messages (Equation 2.13). In other words, M[SmartTraversal] = 2(m + n − 1).

(2.22)

The problem with DF Traversal was its time complexity: It was to reduce time in which we developed more complex protocols. How about the time costs of this simple new protocol? The ideal time of Shout+ is exactly d + 1. The ideal time of DF Traversal in a tree is 2(n − 1). Hence the total is T[SmartTraversal] ≤ 2n + d − 1.

(2.23)

In other words, SmartTraversal not only is simple but also has optimal time and message complexity. 2.5.5 Spanning-Tree Construction with Multiple Initiators We have started examining the spanning-tree construction problem in Section 2.5 assuming that there is a unique initiator. This is unfortunately a very strong (and “unnatural”) assumption to make, as well as difﬁcult and expensive to guarantee. What happens to the single-initiator protocols Shout and df-SPT if there is more than one initiator? Let us examine ﬁrst protocol Shout. Consider the very simple case (depicted in Fig. 2.13) of three entities, x, y, and z, connected to each other. Let both x and y be initiators and start the protocol, and let the Q message from x to z arrive there before the one sent by y.

CONSTRUCTING A SPANNING TREE

Q

63

Q

X

Y

Q

X

Y

Q

Q Q

Z

Z

X

Y

X

Y

Q YES

Q Z

Z

FIGURE 2.13: With multiple initiators, Shout creates a forest.

In this case, neither the link (x,y) nor the link (y,z) will be included in the tree; hence, the algorithm creates not a spanning tree but a spanning forest, which is not connected. Consider now protocol df-SPT, discussed in Section 2.5.2. Let us examine its execution in the simple network depicted in Figure 2.14 composed of a chain of four nodes x, y, z, and w. Let y and z be both initiators, and start the traversal by sending the T message to x and w, respectively. Also in this case, the algorithm will create a disconnected spanning forest of the graph. It is easy to verify that the same situation will occur also with the optimized versions (DF+ and DF*) of the protocol (Exercise 2.9.30). The failure of these algorithms is not surprising, as they were developed speciﬁcally for the restricted environment of a Unique Initiator. Removing the restriction brings out the true nature of the problem, which, as we will now see, has a formidable obstacle. 2.5.6 Impossibility Result Our goal is to design a spanning-tree protocol, which works solely under the standard assumptions and thus is independent of the number of initiators. Unfortunately, any design effort to this end is destined to fail. In fact Theorem 2.5.6 The SPT problem is deterministically unsolvable under R. Deterministically unsolvable means that there is no deterministic protocol that always correctly terminates within ﬁnite time.

64

BASIC PROBLEMS AND PROTOCOLS

T X

T Y

Z

X

Y

Z

X

Y

Return

W

T

T

W

T Z

W

Back

Return

X

Y

Z

W

X

Y

Z

W

FIGURE 2.14: With multiple initiators, df-SPT creates a forest.

Proof. To see why this is the case, consider the simple system composed of three entities x, y, and z connected by links labeled as shown in Figure 2.15. Let the three entities have identical initial values (the symbols x, y, z are used only for description purposes). If a solution protocol A exists, it must work under any conditions of message delays (as long as they are ﬁnite) and regardless of the number of initiators. Consider a synchronous schedule (i.e., an execution where communication delays are unitary) and let all three entities start the execution of A simultaneously. Since they are in identical states (same initial status and values, same port labels), they will execute the

X 1

X 1

2

2

2

1

Y

Z 1

2

2

1

Y

Z 1

FIGURE 2.15: Proof of Theorem 2.5.6.

2

CONSTRUCTING A SPANNING TREE

65

same rule, obtain the same results (thus, continuing to have the same local values), compose and send (if any) the same messages, and enter the same (possibly new) status. In other words, by Property 1.6.2, they will remain in identical states. In the next time unit, all sent messages (if any) will arrive and be processed. If one entity receives a message, the others will receive the same message at the same time, perform the same local computation, compose and send (if any) the same messages, and enter the same (possibly new) status. And so on. In other words, the entities will continue to be in identical states. If A is a solution protocol, it must terminate within ﬁnite time. A spanning tree of our simple system is obtained by removing one of the three links, let us say (x,y). In this case, Tree-neigbors will be the port label 2 for entity x and the port label 1 for entity y; instead, z has in Tree-neighbors both port numbers. In other words, when they all terminate, they have distinct values for their local variable Tree-neighbors. But this is impossible, since we just said that the states of the entities are always identical. Thus, no such a solution algorithm A exists. 䊏 A consequence of this very negative result is that, to construct a spanning tree without constraints on the number of initiators, we need to impose additional restrictions. To determine the “minimal” restrictions that, added to R, will enable us to solve SPT is an interesting research problem still open. The restriction that is commonly used is a very powerful one, Initial Distinct Values, and we will discuss it next. 2.5.7 SPT with Initial Distinct Values The impossibility result we just witnessed implies that, to solve the SPT problem, we need an additional restriction. The one commonly used is Initial Distinct Values (ID): Each entity has a distinct initial value. Distinct initial values are sometimes called identiﬁers or ids or global names. We will now examine some ways in which SPT can be solved under IR = R ∪ {ID}. Multiple Spanning Trees As in most software design situations, once we have a solution for a problem and are faced with a more general one, one approach is to try to ﬁnd ways to re-use and re-apply the already existing solution. The solutions we already have are unique-initiator ones and, as we know, they fail in presence of multiple initiators. Let us see how can we mend their shortcomings using distinct values. Consider the execution of Shout in the example of Figure 2.13. In this case, the reason why the protocol fails is because the entities do not realize that there are two different requests (e.g., when x receives Q from y) for spanning-tree construction. But we can now use the entities’ ids to distinguish between requests originating from different initiators. The simplest and most immediate application of this approach is to have each initiator construct “its own” spanning tree with a single-initiator protocol and to use

66

BASIC PROBLEMS AND PROTOCOLS

the ids of the initiators to distinguish among different constructions. So, instead of cooperating to construct a single spanning tree, we will have several spanning trees concurrently and independently built. This implies that all the protocol messages (e.g., Q and Y es in Shout+) must contain also the id of the initiator. It also requires additional variables and bookkeeping; for example, at each entity, there will be several instances of the variable tree-neighbors, one for each spanning tree being constructed (i.e., one for each initiator). Furthermore, each entity will be in possibly different status values for each of these independent SPT-constructions. Recall that the number k of initiators is not known a priori and can change at every execution. The message cost of this approach depends solely on the number of initiators and on the type of unique-initiator protocol used. But it is in any case very expensive. In fact, if we employ the most efﬁcient SPT-construction protocol we know, Shout+, we will use 2mk messages, which could be as bad as O(n3 ). Selective Construction The large message cost derives from the fact that we construct not one but k spanning trees. Since our goal is just to construct one, there is clearly a needless amount of communication and computation being performed. A better approach consists of letting every initiator start the construction of its own uniquely identiﬁed spanning tree (as before), but then suppressing some of these constructions, allowing only one to complete. In this approach, an entity faced with two different SPT-constructions will select and act on only one, “killing” the other; the entity continues this selection process as long as it receives conﬂicting requests. The criterion an entity uses to decide which SPT-construction to follow and which one to terminate must be chosen very carefully. In fact, the danger is to “kill” all constructions. The criterion commonly used is based on min-id: Since each SPT-construction has a unique id (that of its initiator), when faced with different SPT-constructions, an entity will choose the one with the smallest id and terminate all the others. (An alternative criterion would be the one based on max-id.) The solution obtained with this approach has some very clear advantages over the previous solution. First of all, each entity is at any time involved only in one SPTconstruction; this fact greatly simpliﬁes the internal organization of the protocol (i.e., the set of rules), as well as the local storage and bookkeeping of each entity. Second, upon termination, all entities have a single shared spanning tree for subsequent uses. However, there is still competitive concurrency: An entity involved in one SPTconstruction might receive messages from another construction; in our approach, it will make a choice between the two constructions. If the entity chooses the new one, it will give up all the knowledge (variables, etc) acquired so far and start from scratch. The message cost of this approach depends again on the number of initiators and on the unique-initiator protocol used. Consider a protocol developed using this approach, using Shout+ as the basic tool. Informally, an entity u, at any time, participates in the construction of just one spanning tree rooted in some initiator, x. It will ignore all messages referring to the construction of other spanning trees where the initiators have larger ids than x. If

CONSTRUCTING A SPANNING TREE

67

instead u receives a message referring to the construction of a spanning tree rooted in an initiator y with an id smaller than x’s, then u will stop working for x and start working for y. As we will see, these techniques will construct a spanning tree rooted in the initiator with the smallest initial value. IMPORTANT. It is possible that an entity has already terminated its part of the construction of a spanning tree when it receives a message from another initiator (possibly, with a smaller id). In other words, when an entity has terminated a construction, it does not know whether it might have to restart again. Thus, it is necessary to include in the protocol a mechanism that ensures an effective local termination for each entity. This can be achieved by ensuring that we use, as a building block, a uniqueinitiator SPT-protocol in which the initiator will know when the spanning tree has been completely constructed (see Exercise 2.9.24). In this way, when the spanning tree rooted in the initiator s with the smallest initial value has been constructed, s will become aware of this fact (as well as that all other constructions, if any, have been “killed”). It can then notify all other entities so that they can enter a terminal status. The notiﬁcation is just a broadcast; it is appropriate to perform it on the newly constructed spanning-tree (so we start taking advantage of its existence). Protocol MultiShout, depicted in Figures 2.16 and 2.17, uses Shout+ appropriately modiﬁed so to ensure that the root of a constructed tree becomes aware of termination and includes a ﬁnal broadcast (on the spanning tree) to notify all entities that the task has been indeed completed. We denote by v(x) the id of x; initially all entities are idle and any of them can spontaneously start the algorithm. Theorem 2.5.7 Protocol MultiShout constructs a spanning tree rooted in the initiator with the smallest initial value. Proof. Let s be the initiator with the smallest initial value. Focus on an initiator x = s; its initial execution of the protocol will start the construction of a spanning tree Tx rooted in x. We will ﬁrst show that the construction of Tx will not be completed. To see this, observe that Tx must include every node, including s; but when s receives a message relating to the construction of somebody’s else tree (such as Tx ), it will ignore it, killing the construction of that tree. Let us now show that Ts will instead be constructed. Since the id of s is smaller than all other ids, no entity will ignore the messages related to the construction of Ts started by s; thus, the construction will be completed. 䊏 Let us now consider the message costs of protocol MultiShout. It is clearly more efﬁcient than protocols obtained with the previous approach. However, in the worst case, it is not much better in order of magnitude. In fact, it can be as bad as O(n3 ). Consider for example the graph, shown in Figure 2.18, where n − k of the nodes are fully connected among themselves (the subgraph Kn−k ), and each of the other

68

BASIC PROBLEMS AND PROTOCOLS

PROTOCOL MultiShout

Status: S = {IDLE, ACTIVE, DONE}; SI NI T = {IDLE}; ST ERM = {DONE}. Restrictions: R ;ID. IDLE Spontaneously begin root:= true; root id:=v(x); Tree neighbors:=∅; send(Q,root id) to N (x); counter:=0; check counter:=0; become ACTIVE; end Receiving(Q,id) begin CONSTRUCT; end ACTIVE Receiving(Q,id) begin if root id = id then counter:=counter+1; if counter=|N (x)| then done:= true; CHECK; endif else if root id > id then CONSTRUCT; endif end Receiving(Yes, id) begin if root id = id then Tree-neighbors:=Tree-neighbors ∪{sender}; counter:=counter+1; if counter=|N (x)| then done:= true; CHECK; endif endif end Receiving(Check, id) begin if root id = id then check counter:=check counter+1; if (done ∧ check counter=|Children|) then TERM; endif endif end Receiving(Terminate) begin send(Terminate) to Children; become DONE; end

FIGURE 2.16: Protocol MultiShout

CONSTRUCTING A SPANNING TREE

Procedure CONSTRUCT begin root:= false; root id:= id; Tree neighbors:={sender}; parent:= sender; send(Yes,root id) to {sender}; counter:=1; check counter:=0; if counter=|N (x)| then done:= true; CHECK; else send(Q,root-id) to N (x) − {sender}; endif become ACTIVE; end

Procedure CHECK begin Children:= Tree neighbors-{parent}; if Children = ∅ then send(Check,root id) to parent; endif end

Procedure TERM begin if root then send(Terminate) to Tree-neighbors; become DONE; else send(Check,root-id) to parent; endif end

FIGURE 2.17: Routines of MultiShout

x1 x2 Kn − k

xk

FIGURE 2.18: The execution of MultiShout can cost O(k(n − k)2 ) messages.

69

70

BASIC PROBLEMS AND PROTOCOLS

k (nodes x1 , x2 , . . . , xk ) is connected only to a node in Kn−k . Suppose that these k “external” nodes are the initiators and that v(x1 ) > v(x2 ) > · · · > v(xk ), Consider now an execution where the Q messages from the external entities arrive to Kn−k in order, according to the indices (i.e., the one from x1 arrives ﬁrst). When the Q message from x1 arrives to Kn−k it will trigger the SPT-construction there. Notice that the Shout+ component of our protocol with a unique initiator will use O((n − k)2 ) messages inside the subgraph Kn−k . Assume that the entire computation inside Kn−k triggered by x1 is practically completed (costing O((n − k)2 ) messages) by the time the Q message from x2 arrives to Kn−k . Since v(x1 ) > v(x2 ), all the work done in Kn−k has been wasted and every entity there must start the construction of the spanning tree rooted in x2 . In the same way, assume that the time delays are such that the Q message from xi arrives to Kn−k only when the computation inside Kn−k triggered by xi−1 is practically completed (costing O((n − k)2 ) messages). Then, in this case (which is possible), work costing O((n − k)2 ) messages will be repeated k times, for a total of O(k(n − k)2 ) messages. If k is a linear fraction of n (e.g., k = n/2), then the cost will be O(n3 ). The fact that this solution is not very efﬁcient does not imply that the approach of selective construction it uses is not effective. On the contrary, it can be made efﬁcient at the expenses of simplicity. We will examine it in great details later in the book when studying the leader election problem.

2.6 COMPUTATIONS IN TREES In this section, we consider computations in tree networks under the standard restrictions R plus clearly the common knowledge that the network is tree. Note that the knowledge of being in a tree implies that each entity can determine whether it is a leaf (i.e., it has only one neighbor) or an internal node (i.e., it has more than one neighbor). We have already seen how to solve the Broadcast, the Wake-Up, and the Traversal problems in a tree network. The ﬁrst two are optimally solved by protocol Flooding, the latter by protocol DF Traversal. These techniques constitute the ﬁrst set of algorithmic tools for computing in trees with multiple initiators. We will now introduce another very basic and useful technique, saturation, and show how it can be employed to efﬁciently solve many different problems in trees regardless of the number of initiators and of their location. Before doing so, we need to introduce some basic concepts and terminology about trees. In a tree T, the removal of a link (x,y) will disconnect T into two trees, one containing x (but not y), the other containing y (but not x); we shall denote them by T [x − y] and T [y − x], respectively. Let d[x, y] = Max{d(x, z) : z ∈ T [y − x]} be the longest distance between x and the nodes in T [y − x]. Recall that the longest distance between any two nodes is called diameter, and it is denoted by d. If d[x, y] = d, the path between x and y is said to be diametral.

COMPUTATIONS IN TREES

71

2.6.1 Saturation: A Basic Technique The technique, which we shall call Full Saturation, is very simple and can be autonomously and independently started by any number of initiators. It is composed of three stages: 1. the activation stage, started by the initiators, in which all nodes are activated; 2. the saturation stage, started by the leaf nodes, in which a unique couple of neighboring nodes is selected; and 3. the resolution stage, started by the selected pair. The activation stage is just a wake-up: each initiator sends an activation (i.e., wakeup) message to all its neighbors and becomes active; any noninitiator, upon receiving the activation message from a neighbor, sends it to all its other neighbors and becomes active; active nodes ignore all received activation messages. Within ﬁnite time, all nodes become active, including the leaves. The leaves will start the second stage. Each active leaf starts the saturation stage by sending a message (call it M) to its only neighbor, referred now as its “parent,” and becomes processing. (Note: M messages will start arriving within ﬁnite time to the internal nodes.) An internal node waits until it has received an M message from all its neighbors but one, sends a M message to that neighbor that will now be considered its “parent,” and becomes processing. If a processing node receives a message from its parent, it becomes saturated. The resolution stage is started by the saturated nodes; the nature of this stage depends on the application. Commonly, this stage is used as a notiﬁcation for all entities (e.g., to achieve local termination). Since the nature of the ﬁnal stage will depend on the application, we will only describe the set of rules implementing the ﬁrst two stages of Full Saturation. IMPORTANT. A “truncated” protocol like this will be called a “plug-in”. In its execution, not all entities will enter a terminal status. To transform it into a full protocol, some other action (e.g., the resolution stage) must be performed so that eventually all entities enter a terminal status. It is assumed that initially all entities are in the same status available. Let us now discuss some properties of this basic technique. Lemma 2.6.1 Exactly two processing nodes will become saturated; furthermore, these two nodes are neighbors and are each other’s parent. Proof. From the algorithm, it follows that an entity sends a message M only to its parent and becomes saturated only upon receiving an M message from its parent. Choose an arbitrary node x, and traverse the “up” edge of x (i.e., the edge along which the M message was sent from x to its parent). By moving along “up” edges, we must meet a saturated node s1 since there are no cycles in the graph. This node has become saturated when receiving an M message from its parent s2 . Since s2

72

BASIC PROBLEMS AND PROTOCOLS

PLUG-IN Full Saturation .

Status: S = {AVAILABLE, ACTIVE, PROCESSING, SATURATED}; SI NI T = {AVAILABLE};

Restrictions: R ∪ T. AVAILABLE

Spontaneously begin send(Activate) to N(x); Initialize; Neighbors:= N (x); if|Neighbors|=1 then Prepare Message; parent ⇐ Neighbors; send(M) to parent; become PROCESSING; else become ACTIVE; endif end Receiving(Activate) begin send(Activate) to N (x) − {sender}; Initialize; Neighbors:= N (x); if|Neighbors|=1 then Prepare Message; parent ⇐ Neighbors; send(M) to parent; become PROCESSING; else become ACTIVE; endif end ACTIVE Receiving(M) begin Process Message; Neighbors:= Neighbors−{sender}; if|Neighbors|=1 then Prepare Message; parent ⇐ Neighbors; send(M) to parent; become PROCESSING; endif end PROCESSING Receiving(M) begin Process Message; Resolve; end

FIGURE 2.19: Full Saturation

COMPUTATIONS IN TREES

73

Procedure Initialize begin nil; end Procedure Prepare Message begin M:=("Saturation"); end Procedure Process Message begin nil; end Procedure Resolve begin become SATURATED; Start Resolution stage; end

FIGURE 2.20: Procedures used by Full Saturation

has sent an M message to s1 , this implies that s2 must have been processing and must have considered s1 its parent; thus, when the M message from s1 will arrive at s2 , s2 will become saturated also. Thus, there exist at least two nodes that become saturated; furthermore, these two nodes are each other’s parent. Assume that there are more than two saturated nodes; then there exist two saturated nodes, x and y, such that d(x, y) ≥ 2. Consider a node z on the path from x to y; z could not send am M message toward both x and y; therefore, one of the nodes cannot be saturated. Therefore, the lemma holds. 䊏 IMPORTANT. It depends on the communication delays which entities will become saturated and it is therefore totally unpredictable. Subsequent executions with the same initiators might generate different results. In fact any pair of neighbors could become saturated. The only guarantee is that a pair of neighbors will be selected; since a pair of neighbors uniquely identiﬁes an edge, the one connecting them; this result is also called edge election. To determine the number of message exchanges, observe that the activation stage is a wake-up in a tree and hence it will use n + k − 2 messages (Equation 2.5), where k denotes the number of initiators. During the saturation stage, exactly one message is transmitted on each edge, except the edge connecting the two saturated nodes on which two M messages are transmitted, for a total of n − 1 + 1 = n messages. Thus, M[Full Saturation] = 2n + k − 2.

(2.24)

74

BASIC PROBLEMS AND PROTOCOLS

Notice that only n of those messages are due to the saturation stage. To determine the ideal time complexity, let I ⊆ V denote the set of initiator nodes, L ⊆ V denote the set of leaf nodes; t(x) the time delay, from the initiation of the algorithm, until node x becomes active. To become saturated, node s must have waited until all the leafs have become active and the M messages originated from them have reached s; that is, it must have waited Max{t(l) + d(l, s) : l ∈ L}. To become active, a noninitiator node x must have waited for an “Activation” message to reach it, while there is no additional waiting time for an initiator node; thus, t(x) = Min{d(x, y) + t(y) : y ∈ I }. Therefore, the total delay, from the initiation of the algorithm, until s becomes saturated (and, thus, the ideal execution delay of the algorithm) is T[Full Saturation] = Max{Min{d(l, y) + t(y)} + d(l, y) : y ∈ I, l ∈ L}.

(2.25)

We will now discuss how to apply the saturation technique to solve different problems. 2.6.2 Minimum Finding Let us see how the saturation technique can be used to compute the smallest among a set of values distributed among the nodes of the network. Every entity x has an input value v(x) and is initially in the same status; the task is to determine the minimum among those input values. That is, in the end, each entity must know whether or not its value is the smallest and enter the appropriate status, minimum or large, respectively. IMPORTANT. Notice that these values are not necessarily distinct. So, more than one entity can have the minimum value; all of them must become minimum. This problem is called Minimum Finding (MinFind) and is the simplest among the class of Distributed Query Processing problems that we will examine in later chapters: a set of data (e.g., a ﬁle) is distributed among the sites of a communication network; queries (i.e., external requests for information about the set) can arrive at any time at any site (which becomes an initiator of the processing), triggering computation and communication activities. A stronger version of this problem requires all entities to know the minimum value when they enter the ﬁnal status. Let us see how to solve this problem in a tree network. If the tree was rooted, then this task can be trivially performed. In fact, in a rooted tree not only is there a special node, the root, but also a logical orientation of the links: “up” toward the root and “down” away from the root; this corresponds to the “parent” and “children” relationship, respectively. In a rooted tree, to ﬁnd the minimum, the root would broadcast down the request to compute the minimum value; exploiting the orientation of the links, the entities will then perform a convergecast (described in more details in Section 2.6.7): starting from the leaves, the nodes determine the smallest value among the values “down” and send it “up.” As a result of this process, the minimum value is then determined at the root, which will then broadcast it to all nodes.

COMPUTATIONS IN TREES

75

PROCESSING Receiving(Notification) begin send(Notification) to N (x)−parent; if v(x) =Received Value then become MINIMUM; else become LARGE; endif end Procedure Initialize begin min:=v(x); end Procedure Prepare Message begin M:=("Saturation", min); end Procedure Process Message begin min:= MIN{min, Received Value}; end Procedure Resolve begin Notification:= ("Resolution", min); send(Notification) to N (x)−parent; if v(x) =min then become MINIMUM; else become LARGE; endif end

FIGURE 2.21: New Rule and Procedures used for Minimum Finding

Notice that convergecast can be used only in rooted trees. The existence of a root (and the additional information existing in a rooted tree) is, however, a very strong assumption; in fact, it is equivalent to assuming the existence of a leader (which, as we will see, might not be computable). Full Saturation allows to achieve the same goals without a root or any additional information. This is achieved simply by including in the M message the smallest value known to the sender. Namely, in the saturation stage the leaves will send their value with the M message, and each internal node sends the smallest among its own value and all the received ones. In other words, MinF-Tree is just protocol Full Saturation where the procedures Initialize, Prepare Message, and Process Message are as shown in Figure 2.21 and where the resolution stage is just a notiﬁcation started by the two saturated nodes, of the minimum value they have computed. This is obtained by simply modifying procedure Resolve accordingly and adding the rule for handling the reception of the notiﬁcation.

76

BASIC PROBLEMS AND PROTOCOLS

The correctness follows from the fact that both saturated nodes know the minimum value (Exercise 2.9.31). The number of message transmission for the minimum-ﬁnding algorithm MinFTree will be exactly the same as the one experienced by Full Saturation plus the ones performed during the notiﬁcation. Since a notiﬁcation message is sent on every link except the one connecting the two saturated nodes, there will be exactly n − 2 such messages. Hence M[MinF − Tree] = 3n + k − 4.

(2.26)

The time costs will be the one experienced by Full Saturation plus the ones required by the notiﬁcation. Let Sat denote the set of the two saturated nodes; then T[MinF − Tree] = T[Full Saturation] + Max{d(s, x) : s ∈ Sat, x ∈ V }.

(2.27)

2.6.3 Distributed Function Evaluation An important class of problems are those of Distributed Function Evaluation; that is, where the task is to compute a function whose arguments are distributed among the processors of a distributed memory system (e.g., the sites of a network). An instance of this problem is the the one we just solved: Minimum Finding. We will now discuss how the saturation technique can be used to evaluate a large class of functions. Semigroup Operations Let f be an associative and commutative function deﬁned over all subsets of the input values. Examples of this type of functions are: minimum, maximum, sum, product, and so forth, as well as logical predicates. Because of their algebraic properties, these functions are called semigroup operations. IMPORTANT. It is possible that some entities do not have an argument (i.e., initial value) or that the function must only be evaluated on a subset of the arguments. We shall denote the fact that x does not have an argument by v(x) = nil. The same approach that has led us to solve Minimum Finding can be used to evaluate f. The protocol Function Tree is just protocol Full Saturation where the procedures Initialize, Prepare Message, and Process Message are as shown in Figure 2.22 and where the resolution stage is just a notiﬁcation started by the two saturated nodes, of the ﬁnal result of the function they have computed. This is obtained by simply modifying procedure Resolve accordingly and adding the rule for handling the reception of the notiﬁcation. The correctness follows from the fact that both saturated nodes know the result of the function (Exercise 2.9.32). For particular types of functions, see Exercises 2.9.33, 2.9.34, and 2.9.35.

COMPUTATIONS IN TREES

77

PROCESSING Receiving(Notification) begin result:= received value; send(Notification) to N(x)−parent; become DONE; end

Procedure Initialize begin if v(x) = nil then result:=f (v(x)); else result:=nil; end Procedure Prepare Message begin M:=("Saturation", result); end Procedure Process Message begin if received value = nil then if result = nil then result:= f (result, received value); else result:= f (received value); endif endif end Procedure Resolve begin Notification:= ("Resolution", result); send(Notification) to N (x)−parent; become DONE; end

FIGURE 2.22: New Rule and Procedures used for Function Tree

The time and message costs of the protocol are exactly the same as the one for Minimum Finding. Thus, semigroup operations can be performed optimally on a tree with any number of initiators and without a root or additional information. Cardinal Statistics A useful class of functions are statistical ones, such as average, standard deviation, and so for. These functions are not semigroup operation but can nevertheless be optimally solved using the saturation technique. We will just examine, as an example, the computation of Ave, the average of the (relevant) entities’ values. Observe that Ave ≡ Sum / Size where Sum is the the sum of all (relevant) values, and Size is the number of those values. Since Sum is a semigroup operation, we already know how to compute it. Also Size is trivially computed using saturation (Exercises 2.9.36 and 2.9.37).

78

BASIC PROBLEMS AND PROTOCOLS

We can collect at the two saturated nodes Sum and Size with a single execution of Saturation: the M message will contain two data ﬁelds M=(“Saturation,” sum,size), which are initialized by each leaf node and updated by the internal ones. The resolution stage is just a notiﬁcation started by the two saturated nodes, of the average they can have computed. Similarly, a single execution of Full Saturation with a ﬁnal notiﬁcation of the result will allow the entities to compute cardinal statistics on the input values. Notice that ordinal statistics (e.g., median) are in general more difﬁcult to resolve. We will discuss them in the chapter on selection and sorting of distributed data. 2.6.4 Finding Eccentricities The basic technique has been so far used to solve single-valued problems; that is, problems whose solution requires the identiﬁcation of a single value. It can also be used to solve multi-valued problems such as the problem of determining the eccentricities of all the nodes.

PROCESSING Receiving(Notification) begin result:= received value; send(Notification) to N(x)−parent; become DONE; end Procedure Initialize begin sum:=v(x); size:=1; end Procedure Prepare Message begin M:=("Saturation", sum,size); end Procedure Process Message begin sum:= sum + Received sum; size:=size + Received size; end Procedure Resolve begin result := sum / size; Notification:= ("Resolution", result); send(Notification) to N (x)−parent; become DONE; end

FIGURE 2.23: New Rule and Procedures used for computing the Average

COMPUTATIONS IN TREES

79

The eccentricity of a node x, denoted by r(x), is the largest distance between x and any other node in the tree: r(x) = Max{d(x, y) : y ∈ V }; note that a center is a node with the smallest eccentricity. (We brieﬂy discussed center and eccentricity already in Section 2.5.3.) To compute its own eccentricity, a node x needs to determine the maximum distance from all other nodes in the tree. To accomplish this, x needs just to broadcast the request, making itself the root of the tree, and, using convergecast on this rooted tree, collect the maximum distance to itself. This approach would require 2(n − 1) messages and it is clearly optimal with respect to order of magnitude. If we want every entity to compute its eccentricity, this however would lead to a solution that requires 2(n2 − n) messages. We will now show that saturation will yield instead a O(n), and thus optimal, solution. The ﬁrst step is to use saturation to compute the eccentricity of the two saturated nodes. Notice that we do not know a priori which pair of neighbors will become saturated. We can nevertheless ensure that when they become saturated they will know their eccentricity. To do so, it is enough to include, in the M message sent by an entity x to its neighbor y, the maximum distance from x to the nodes in T [x − y], increased by 1. In this way, a saturated node s will know d[s, y] for each neighbor y; thus, it can determine its eccentricity (Exercise 2.9.38). Our goal is to have all nodes determine their eccentricity, not just the saturated ones. The interesting thing is that the information available at each entity at the end of the saturation stage is almost sufﬁcient to make them compute their own eccentricity. Consider an entity u; it sent the M message to its parent v, after it received one from all its other neighbors; the message from y = v contained d[u, y]. In other words, u knows already the maximum distance from all the entities except the ones in the tree T [v − u]. Thus, the only information u is missing is d[u, v] = Max{d(u, y) : y ∈ T [v − u]}. Notice that (Exercise 2.9.39) d[u, v] = Max{d(u, y) : y ∈ T [v − u]} = 1 + Max{d[v, z] : z = u ∈ N (v)}. (2.28) Summarizing, every node, except the saturated ones, is missing one piece of information: the maximum distance from the nodes on the other side of the link connecting it to its parent. If the parents could provide this information, the task can be completed. Unfortunately, the parents are also missing the information, unless they are the saturated nodes. The saturated nodes have all the information they need. They also have the information their neighbors are missing: let s be a saturated node and x be an unsaturated neighbor; x is missing the information d[x, s]; by Equation 2.28, this is exactly d[x, s] = 1 + Max{d[s, z] : x = z ∈ N (s)}, and s knows all the d[s, z] (they were included in the M messages it received). So, the saturated nodes s can provide the needed information to their neighbors, who can then compute their eccentricity. The nice property is that now these neighbors have the information required by their own neighbors (further away from the saturated nodes). Thus, the resolution stage of Full

80

BASIC PROBLEMS AND PROTOCOLS

PROCESSING Receiving("Resolution",dist) begin Resolve; end Procedure Initialize begin Distance[x]:= 0; end Procedure Prepare Message begin maxdist:= 1+ Max{Distance[*]}; M:=("Saturation", maxdist); end Procedure Resolve begin Process Message; Calculate Eccentricity; forall y ∈ N (x) − {parent} do maxdist:= 1 + Max{Distance[z]: z ∈ N (x) − {parent, y}}; send("Resolution", maxdist) to y; endfor become DONE; end Procedure Process Message begin Distance[sender]:= Received distance; end Procedure Calculate Eccentricity begin r(x):= Max{Distance[z]: z ∈ N (x)}; end

FIGURE 2.24: New Rule and Procedures used for computing the Eccentricities

Saturation can be used to provide the missing information: starting from the saturated nodes, once an entity receives the missing information from a neighbor, it will compute its eccentricity and provide the missing information to all its other neighbors. IMPORTANT. Notice that, in the resolution stage, an entity sends different information to each of its neighbors. Thus, unlike the resolution we used so far, it is not a notiﬁcation. The protocol Eccentricities will thus be a Full Saturation where the procedures Initialize, Prepare Message, and Process Message are as shown in Figure 2.24. The rules for handling the reception of the message, the procedure Resolve, and the procedure to calculate the eccentricity are also shown in Figure 2.24. Notice that, even though each node receives a different message in the resolution stage, only one message will be received by each node in that stage, except

COMPUTATIONS IN TREES

81

the saturated nodes, which will receive none. Thus, the message cost of protocol Eccentricities will be exactly as the one of MinF-Tree and so will the time cost: M[Eccentricities] = 3n + k − 4 ≤ 4n − 4.

(2.29)

T[Eccentricities] = T[MinF − T ree].

(2.30)

2.6.5 Center Finding A center is a node from which the maximum distance to all other nodes is minimized. A network might have more than one center. The Center Finding problem (Center) is to make each entity aware of whether or not it is a center by entering the appropriate terminal status center or not-center, respectively. A Simple Protocol To solve Center we can use the fact that a center is exactly a node with the smallest eccentricity. Thus a solution protocol consists of ﬁnding the minimum among all eccentricities, combining the protocols we have developed so far: 1. Execute protocol Eccentricities; 2. Execute the last two stages (saturation and resolution) of MinF-Tree. Part (1) will be started by the initiators; part (2) will be started by the leaves once, upon termination of their execution of Eccentricities, they know their eccentricity; the saturation stage of MinF-Tree will determine at two new saturated nodes the minimum overall eccentricity and will be broadcasted in the notiﬁcation stage by them. At that time, an entity can determine if it is a center or not. This approach will cost 3n + k − 4 messages for part (1) and n + n − 2 = 2n − 2 for part (2), for a total of 5n + k − 6 ≤ 6n − 6 messages. The time costs are no more than T[Eccentricities] +2d ≤ 4d. A Reﬁned Protocol An improvement can be derived by exploiting the structure of the problem in more details. Recall that d[x, y] = Max{d(x, z) : z ∈ T [y − x]} is the longest distance between x and the nodes in T [y − x]. Let d1 [x] and d2 [x] be the largest and second-largest of all {d[x, y] : y ∈ N (x)}, respectively. The centers of a tree have some very interesting properties. Among them Lemma 2.6.2 In a tree either there is a unique center or there are two centers and they are neighbors. Lemma 2.6.3

In a tree all centers lie on all diametral paths.

Lemma 2.6.4 A node x is a center if and only if d1 [x] − d2 [x] ≤ 1; if strict inequality holds, then x is the only center.

82

BASIC PROBLEMS AND PROTOCOLS

Lemma 2.6.5 Let y and z be neighbors of x such that d1 [x] = d[x, y] and d2 [x] = d[x, z]. If d[x, y] − d[x, z] > 1, then all centers are in T [y − x]. Lemma 2.6.4 gives us the tool we need to devise a solution protocol: an entity x can determine whether or not it is a center, provided it knows the value d[x, y] for each of its neighbors y. But this is exactly the information that was provided to x by protocol Eccentricities so it could compute r(x). This means that to solve Center it sufﬁces to execute Eccentricities. Once an entity has all the information to compute its radius, it will check whether the largest and the second largest received values differ at most by one; if so, it becomes center, otherwise not-center. Thus, the solution protocol Center Tree is obtained from Eccentricities adding this test and some bookkeeping (Exercise 2.9.40). The time and message costs of Center Tree will be exactly the same as that of Eccentricities. M[Center Tree] = 3n + k − 4 ≤ 4n − 4.

(2.31)

T[Center Tree] = T[FullSaturation].

(2.32)

An Efﬁcient Plug-In The solutions we have discussed are full protocols. In some circumstances, however, a plug-in is sufﬁcient; that is, when the centers must start another global task. In these circumstances, the goal is just for the centers to know that they are centers. In such a case, we can construct a more efﬁcient mechanism, always based on saturation, using the resolution stage in a different way. The properties expressed by Lemmas 2.6.4 and 2.6.5 give us the tools we need to devise the plug-in. In fact, by Lemma 2.6.4, x can determine whether or not it is a center once it knows the value d[x, y] for each of its neighbors y. Furthermore, if x is not a center, by Lemma 2.6.5, this information is sufﬁcient to determine in which subtree T [y − x] a center resides. Thus, the solution is to collect such values at a node x; determine whether x is a center; and, if not, move toward a center until it is reached. In order to collect the information needed, we can use the ﬁrst two stages (Wakeup and Saturation) of protocol Eccentricities. Once a node becomes saturated, it can determine whether it is a center by checking whether the largest and the second largest received values differ at most by one. If it is not a center, it will know that the center(s) must reside in the direction from which the largest value has been received. By keeping track at each node (during the saturation stage) of which neighbor has sent the largest value, the direction of the center can also be determined. Furthermore, a saturated node can decide whether it is closest to a center or its parent. The saturated node, say x, closest to a center will then send a “Center” message, containing the second largest received value increased by one, in the direction of the center. A processing node receiving such a message will, in turn, be able to determine whether it is a center and, if not, the direction toward the center(s).

COMPUTATIONS IN TREES

83

Once the message arrives at a center c, c will be able to determine if it is the only center or not (using Lemma 2.6.4); in this case, it will know which neighbor is the other center and will notify it. The Center Finding plug-in will then be the Full Saturation plug-in with the addition of the “Center” message traveling from the saturated nodes to the centers. In particular, the routines Initialize, Process Message, Prepare Message, Resolve, and the new rules governing the reception of the “Center” messages are shown in Figure 2.25.

PROCESSING Receiving("Center", value) begin Process Message; Resolve; end Procedure Initialize begin Max Value := 0; Max2 Value := 0; end Procedure Prepare Message begin M:=("Saturation", Max Value+1); end Procedure Process Message begin if Max Counter < Received value then Max2 Value := Max Value; Max Value := Received Value; Max Neighbor := sender; else if Max2 Value < Received value then Max2 Value := Received value; endif endif end Procedure Resolve begin if Max Value - Max2 Value = 1 then if Max Neighbor = parent then send(Center,Max2 Value) to Max Neighbor; endif become CENTER; else if Max Value - Max2 Value > 1 then send(Center,Max2 Value) to Max Neighbor; else become CENTER; endif endif end

FIGURE 2.25: Transforming Saturation into an efﬁcient Plug-In for Center Finding

84

BASIC PROBLEMS AND PROTOCOLS

The message cost of this plug-in is easily determined by observing that, after the Full Saturation plug-in is applied, a message will travel from the saturated node s (closest to a center) to its furthermost center c; hence, d(s, c) additional messages are exchanged. Since d(s, c) ≤ n/2, the total number of message exchanges performed is M[Center − Finding] = 2.5n + k − 2 ≤ 3.5n − 2.

(2.33)

2.6.6 Other Computations The simple modiﬁcations to the basic technique that we have discussed in the previous sections can be applied to solve a variety of other problems efﬁciently. Following is a sample of them and the key properties employed toward their solution. Finding a Median A median is a node from which the average distance to all nodes in the network is minimized. Since a median obviously minimizes the sum of the distances to all other nodes, it is also called a communication center of the network. In a tree, the key properties are: Lemma 2.6.6 In a tree either there is a unique median or there are two medians and they are neighbors.

Given a node x, and a sub-tree T , let g[T , x] = y∈T d(x, y) denote the sum of all distances between x and the nodes in T, and let G[x, y] = g[T , x] − g[T , y] = n + 2 − 2 ∗ |T [y − x]|; then Lemma 2.6.7

Entity x is a median if and only if G[x, y] ≥ 0 for all neighbors y.

Furthermore, Lemma 2.6.8 If x is not the median, there exists a unique neighbor y such that G[y, x] < 0; such a neighbor lies in the path from x to the median. Using these properties, it is simple to construct a full protocol as well as an efﬁcient plug-in, following the same approaches used for center ﬁnding (Exercise 2.9.41). Finding Diametral Paths A diametral path is a path of the longest length. In a network there might be more than one diametral path. The problem we are interested in is to identify all these paths. In distributed terms, this means that each entity needs to know if it is part of a diametral path or not, entering an appropriate status (e.g., on-path or off-path). The key property to solve this problem is Lemma 2.6.9

A node x is on a diametral path if and only if d1 [x] + d2 [x] = d.

COMPUTATIONS IN TREES

85

Thus, a solution strategy will be to determine d, d1 [x], and d2 [x] at every x and then use Lemma 2.6.9 to decide the ﬁnal status. A full protocol efﬁciently implementing this strategy can be designed using the tools developed so far (Exercise 2.9.45). Consider now designing a plug-in instead of a full protocol; that is, we are only interested in that the entities on diametral paths (and only those) become aware of it. In this case, the other key property is Lemma 2.6.4: every center lies on every diametral path. This gives us a starting point to ﬁnd the diametral paths: the centers. To continue, we can then use Lemma 2.6.9. In other words, we ﬁrst ﬁnd the centers (note: they know the diameter) and then propagate the information along the diametral paths. A center (or for that matter, a node on a diametral path) does not know a priori which one of its neighbors is also on a diametral path. It will thus send the needed information to all its neighbors which, upon receiving it, will determine whether or not they are on such a path; if so, they continue the execution (Exercise 2.9.46). 2.6.7 Computing in Rooted Trees Rooted Trees In some cases, the tree T is actually rooted; that is, there is a distinct node, r, called the root, and all links are oriented toward r. In this case, the tree T will be denoted by T[r] . If link (x,y) is oriented from y to x, x is called the parent of y and y is said to be a child of x. Similarly, a descendant of x is any entity z for which there is a directed path from z to x, and an ancestor of x is any entity z for which there is a directed path from x to z. Two important properties of a rooted tree are that the root has no parent, while every other node has only one parent (see Fig. 2.26). Before examining how to compute in rooted trees, let us ﬁrst observe the important fact that transforming a tree into a rooted one might be an impossible task.

S

(a)

(b)

FIGURE 2.26: (a) A tree T; (b) the same tree rooted in s: T[s] .

86

BASIC PROBLEMS AND PROTOCOLS

x

1

1

y

FIGURE 2.27: It is impossible to transform this tree into a rooted one.

Theorem 2.6.1 The problem of transforming trees into rooted ones is deterministically unsolvable under R. Proof. Recall that deterministically unsolvable means that there is no deterministic protocol that always correctly terminates within ﬁnite time. To see why this is true, consider the simple tree composed of two entities x and y connected by links labeled as shown in Figure 2.27. Let the two entities have identical initial values (the symbols x, y are used only for description purposes). If a solution protocol A exists, it must work under any conditions of message delays (as long as they are ﬁnite) and regardless of the number of initiators. Consider a synchronous schedule (i.e., an execution where communication delays are unitary) and let both entities start the execution of A simultaneously. Since they are identical (same initial status and values, same port labels), they will execute the same rule, obtain the same results (thus, continuing to have the same local values), compose and send (if any) the same messages, and enter the same (possibly new) status. In other words, they will remain identical. In the next time unit, all sent messages (if any) will arrive and be processed. If one entity receives a message, the other will receive the same message at the same time, perform the same local computation, compose and send (if any) the same messages, and enter the same (possibly new) status. And so on. In other words, the two entities will continue to be identical. If A is a solution protocol, it must terminate within ﬁnite time; when this occurs, one entity, say x, becomes the root. But since both entities will always have the same state in this execution, y will also become root, contradicting the fact that A is correct. Thus, no such a solution algorithm A exists. 䊏 This means that being in a rooted tree is considerably different from being in a tree. Let us see how to exploit this difference. Convergecast The orientation of the links in a rooted tree is such that each entity has a notion of “up” (i.e., towards the root) and “down” (i.e., away from the root). If we are in a rooted tree, we can obviously exploit the availability of this globally consistent orientation. In particular, in the saturation technique, the process performed in the saturation stage can be simpliﬁed as follows: Convergecast 1. a leaf sends its message to its parent; 2. each internal node waits until it receives a message from all its children; it then sends a message to its parent. In this way, the root (that does not have a parent) will be the sole saturated node and will start the resolution stage.

87

COMPUTATIONS IN TREES

This simpliﬁed process is called convergecast. If we are in a rooted tree, we can solve all the problems we discussed in the previous section (minimum ﬁnding, center ﬁnding, etc.) using convergecast in the saturation stage. In spite of its greater simplicity, the savings in cost due to convergecast is only 1 message (Exercise 2.9.47). Clearly, such an amount alone does not justify the difference between general trees and rooted ones. There are however other advantages in rooted trees, as we will see later. Totally Ordered Trees In addition to the globally consistent orientation “up and down,” a rooted tree has another powerful property. In fact, the port numbers at a node are distinct; thus, they can be sorted, for example, in increasing order, and the corresponding links can be ordered accordingly. This means that the entire tree is ordered. As a consequence, also the nodes can be totally ordered, for example, according to a preorder traversal (see Fig. 2.28). Note that a node might not be aware of its order number in the tree, although this information can be easily acquired in the entire tree (Exercise 2.9.49). This means that, in a rooted tree the root assigns unique ids to the entities. This fact shows indeed the power of rooted trees. The fact that a rooted tree is totally ordered can be exploited also in other computations. Following are two examples. Example: Choosing a Random Entity. In many systems and applications, it is necessary to occasionally select an entity at random. This occurs for instance in routing systems where, to reduce congestion, a message is ﬁrst sent to an intermediate destination chosen at random and then delivered from there to the ﬁnal destination. The same random selection is made, for example, for coordination of a computation, for control of a resource, etc. The problem is how to determine an entity at random. Let us concentrate on uniform choice; that is, every entity must have the same probability, 1/n, of being selected. A1 1

3

1

3

A2 2

1

6

2

2

6

A3 3

5

7

1

2

(a)

5

A6

A8 2

A9

A4 3

A5

1

7

1

A7

A11

A10 2

A12

(b)

FIGURE 2.28: A rooted tree is an ordered tree and unique names can be given to the nodes.

88

BASIC PROBLEMS AND PROTOCOLS

In a rooted tree, it becomes easy for the root to select uniformly an entity at random. Once unique names have been assigned in preorder to the nodes and the root knows the number n of entities, the root needs only to choose locally a number uniformly at random between 1 and n; the entity with such a name will be the selected one. At this point, the only thing that the root r still has to do is to communicate efﬁciently to the selected entity x the result of the selection. Actually, it is not necessary to assign unique names to the identities; in fact, it sufﬁces that each entity knows the number of descendents of each of its children, and the entire process (from initial notiﬁcation to all to ﬁnal notiﬁcation to x) can be performed with at most 2(n − 1) + dT (s, x) messages and 2r(s) + dT (s, x) ideal time units (Exercise 2.9.50). Example: Choosing at Random from a Distributed Set. An interesting computation is the one of choosing at random an element of a set of data distributed (without replication) among the entities. The setting is that of a set D partitioned among the entities; that is, each entity x has a subset Dx ⊆ D of the data where ∪x Dx = D and, for x = y, Dx ∩ Dy = ∅. Let us concentrate again on uniform choice; that is, every data item must have the same probability, 1/|D| of being selected. How can this be achieved? IMPORTANT. Choosing ﬁrst an entity uniformly at random and then choosing an item uniformly at random in the set stored there will NOT give a uniformly random choice from the entire set (Exercise2.9.51). Interestingly, this problem can be solved with a technique similar to that used for selecting an entity at random and with the same cost (Exercise 2.9.52). Application: Broadcast with Termination Detection Convergecast can be used whenever there is a rooted spanning tree. We will now see an application of this fact. It is a “fact of life” in distributed computing that entities can terminate the execution of a protocol at different times; furthermore, when an entity terminates, it is usually unaware of the status of the other entities. This is why we differentiate between local termination (i.e., of the entity) and global termination (i.e., of the entire system). For example, with the broadcast protocol Flooding the initiator of the broadcast does not know when the broadcast is over. To ensure that the initiator of the broadcast becomes aware of when global termination occurs, we need to use a different strategy. To develop this strategy, recall that, if an entity s performs a Flood+Reply (e.g., protocol Shout) in a tree, the tree will become rooted in s: the initiator is the root; for every other node y, the neighbor x from which it receives the ﬁrst broadcasted message is its parent, and all the neighbors that send the positive reply (e.g., “YES” in Shout and Shout+) are its children. This means that convergecast can be “appended” to any Flood+Reply protocol.

SUMMARY

89

Strategy Broadcast with Termination Detection: 1. The initiator s uses any Flood+Reply protocol to broadcast and construct a spanning tree T[s] of the network; 2. Starting from the leaves of T[s] , the entities perform a convergecast on T. At the end of the convergecast, s becomes aware of the global termination of the broadcast (Exercise 2.9.48). As for the cost, to broadcast with termination detection we need just to add the cost of the convergecast to the one of the Flood+Reply protocol used. For example, if we use Shout+, the resulting protocol that we shall call TDCast will then use 2m + n − 1 messages. The ideal time of Shout+ is exactly r(s) + 1; the ideal time of convergecast is exactly the height of the tree T[s] , that is r(s); thus, protocol TDCast has ideal time complexity 2r(s) + 1. This means that termination detection can be added to broadcast with less than twice the cost of broadcasting alone.

2.7 SUMMARY 2.7.1 Summary of Problems Broadcast [Information problem] =⇒ A single entity has special information that everybody must know. Unique Initiator Flooding: Messages = ⌰(m); Time = ⌰(d) Wake-Up [Information/Synchronization problem] =⇒ Some entities are awake; everybody must wake-up. Wake-Up ≡ (Broadcast with multiple initiators) WFlood: Messages = ⌰(m); Time = ⌰(d) Traversal [Network problem] =⇒ Starting form the initiator, each entity is visited sequentially. Unique Initiator DF-Traversal: Messages = ⌰(m); Time = ⌰(n) Spanning-Tree Construction [Network problem]=⇒ Each entity identiﬁes the subset of neighbors in the spanning tree. SPT with unique initiator ≡ Broadcast Unique Initiator: Shout: Messages = ⌰(m); Time = ⌰(d) Multiple Initiators: assume Distinct Initial Values

90

BASIC PROBLEMS AND PROTOCOLS

Election [Control problem] =⇒ One entity becomes leader, all others enter different special status. Distinct Initial Values Minimum Finding [Data problem] =⇒ Each entity must know whether its initial value is minimum or not. Center Finding [Network problem] =⇒ Each entity must know whether or not it is a center of the network. 2.7.2 Summary of Techniques Flooding: with single initiator = broadcast; with multiple initiators = wake-up. Flooding with Reply (Shout ): with single initiator, it creates a spanning tree rooted in the initiator. Convergecast: in rooted trees only. Flooding with Replies plus Convergecast (TDCast): single initiator only, initiator ﬁnds out that the broadcast has globally terminated. Saturation: in trees only. Depth-ﬁrst traversal: single initiator only.

2.8 BIBLIOGRAPHICAL NOTES Of the basic techniques, ﬂooding is the oldest one, still currently and frequently used. The more sophisticated reﬁnements of adding reply and a convergecast were discussed and employed independently by Adrian Segall [11] and Ephraim Korach, Doron Rotem and Nicola Santoro [8]. Broadcasting in a linear number of messages in unoriented hypercubes is due to Stefan Dobrev and Peter Ruzicka [6]. The use of broadcast trees was ﬁrst discussed by David Wall [12]. The depth-ﬁrst traversal protocol was ﬁrst described by Ernie Chang [3]; the ﬁrst hacking improvement is due to Baruch Awerbuch [2]; the subsequent improvements were obtained by Kadathur Lakshmanan, N. Meenakshi, and Krishnaiyan Thulasiraman [9] and independently by Israel Cidon [4]. The difﬁculty of performing a wake-up in labeled hypercubes and in complete graphs has been proved by Stefan Dobrev, Rastislav Kralovic, and Nicola Santoro [5]. The ﬁrst formal argument on the impossibility of some global computations under R (e.g., the impossibility result for spanning-tree construction with multiple initiators) is due to Dana Angluin [1]. The saturation technique is originally due to Nicola Santoro [10]; its application to center and median ﬁnding was developed by Ephraim Korach, Doron Rotem, and Nicola Santoro [8]. A decentralized solution to the ranking problem (Problem 2.9.4) was designed by Ephraim Korach, Doron Rotem, and Nicola Santoro [7]; a less efﬁcient centralized one is due to Shmuel Zaks [13].

EXERCISES, PROBLEMS, AND ANSWERS

91

2.9 EXERCISES, PROBLEMS, AND ANSWERS 2.9.1 Exercises Exercise 2.9.1 Show that protocol Flooding uses exactly 2m − n + 1 messages. Exercise 2.9.2 Design a protocol to broadcast without the restriction that the unique initiator must be the entity with the initial information. Write the new problem deﬁnition. Discuss the correctness of your protocol. Analyze its efﬁciency. Exercise 2.9.3 Modify Flooding so to broadcast under the restriction that the unique initiator must be an entity without the initial information. Write the new problem deﬁnition. Discuss the correctness of your protocol. Analyze its efﬁciency. Exercise 2.9.4 We want to move the system from an initial conﬁguration where every entity is in the same status ignorant except the one that is knowledgeable to a ﬁnal conﬁguration where every entity is in the same status. Consider this problem under the standard assumptions plus Unique Initiator. (a) Prove that, if the unique initiator is restricted to be one of the ignorant entities, this problem is the same as broadcasting (same solution, same costs). (b) Show how, if the unique initiator is restricted to be the knowledgeable entity, the problem can be solved without any communication. Exercise 2.9.5 Design a protocol to broadcast without the Bidirectional Link restriction. Discuss its correctness. Analyze its efﬁciency. Exercise 2.9.6 Prove that, in the worst case, the number of messages used by protocol WFlood is at most 2m. Show under what conditions such a bound will be achieved. Under what conditions will the protocol use only 2m − n + 1 messages? Exercise 2.9.7 Prove that protocol WFlood correctly terminates under the standard set of restrictions BL,C, and TR. Exercise 2.9.8 Write the protocol that implements strategy HyperFlood. Exercise 2.9.9 Show that the subgraph Hk (x), induced by the messages sent when using HyperFlood on the k-dimensional hypercube Hk with x as the initiator, contains no cycles. Exercise 2.9.10 Show that for every x the eccentricity of x in Hk (x) is k. Exercise 2.9.11 Prove that the message complexity of traversal under R is at least m. (Hint: use the same technique employed in the proof of Theorem 2.1.1.)

92

BASIC PROBLEMS AND PROTOCOLS

Exercise 2.9.12 Let G be a tree. Show that, in this case, no Backedge messages will be sent in any execution of DF Traversal. Exercise 2.9.13 Characterize the virtual ring formed by an execution of DF Traversal in a tree network. Show that the ring has 2n − 2 virtual nodes. Exercise 2.9.14 Write the protocol DF++. Exercise 2.9.15 Prove that protocol DF++ correctly performs a depth-ﬁrst traversal. Exercise 2.9.16 Show that, in the execution of DF++, on some back-edges there might be two “mistakes.” Exercise 2.9.17 Determine the exact number of messages transmitted in the worst case when executing DF* in a complete graph. Exercise 2.9.18 Prove that in protocol Shout, if an entity x is in Tree-neighbors of y, then y is in Tree-neighbors of x. Exercise 2.9.19 Prove that in protocol Shout, if an entity sends Yes, then it is connected to the initiator by a path where on every link a Yes has been transmitted. (Hint: use induction.) Exercise 2.9.20 cycles.

Prove that the subnet constructed by protocol Shout contains no

Exercise 2.9.21 Prove that T[Flood+Reply] = T[Flooding]+1. Exercise 2.9.22 Write the set of rules for protocol Shout+. Exercise 2.9.23 Determine under what conditions on the communication delays, protocol Shout will construct a breadth-ﬁrst spanning tree. Exercise 2.9.24 Modify protocol Shout so that the initiator can determine when the broadcast is globally terminated. (Hint: integrate in the protocol the convergecast operation for rooted trees.) Exercise 2.9.25 Modify protocol DF* so that every entity determines its neighbors in the df-tree it constructs. Exercise 2.9.26 Prove that f∗ is exactly the number of leaves of the df-tree constructed by df-SPT. Exercise 2.9.27 Prove that, in the execution of df-SPT, when the initiator becomes done, a df-tree of the network has already been constructed.

EXERCISES, PROBLEMS, AND ANSWERS

93

Exercise 2.9.28 Prove that, for any broadcast protocol, the graph induced by relationship “parent” is a spanning tree of the network. Exercise 2.9.29 of G.

Prove that the bf-tree of G rooted in a center is a broadcast tree

Exercise 2.9.30 Verify that, with multiple initiators, the optimized version DF+ and DF* of protocol df-SPT will always create a spanning forest of the graph depicted in Figure 2.14. Exercise 2.9.31 Prove that when a node becomes saturated in the execution of protocol MinF-Tree, it knows the minimum value in the network. Exercise 2.9.32 Prove that when a node becomes saturated in the execution of protocol Funct-Tree, it knows the value of f. Exercise 2.9.33 Design a protocol to determine if all the entities of a tree network have positive initial values. Any number of entities can independently start. Exercise 2.9.34 Consider a tree system where each entity has a salary and a gender. Some external investigators want to know if all the entities with a salary below $50, 000 are female. Design a solution protocol that can be started by any number of entities independently. Exercise 2.9.35 Consider the same tree system of Question 2.9.34. The investigators now want to know if there is at least one female with a salary above $50, 000. Design a solution protocol that can be started by any number of entities independently. Exercise 2.9.36 Design an efﬁcient protocol to compute the number of entities in a tree network. Any number of entities can independently start the protocol. Exercise 2.9.37 Consider the same tree system of Question 2.9.34. The investigators now want to know how many female entities are in the system. Design a solution protocol that can be started by any number of entities independently. Exercise 2.9.38 Consider the following use of the M message: a leaf will include a value v = 1; an internal node will include one plus the maximum of all the received values. Prove that the saturated nodes will compute their maximum distance from all other nodes. Exercise 2.9.39 Prove that for any link (u, v), d[u, v] = Max {d(u, y) : y∈ T [v − u]} = 1 + Max{d(v, y) : y∈ T [u − v]} = Max{d[v, z] : z = u ∈ N(v)}. Exercise 2.9.40 Modify protocol Eccentricities so it can solve Center, as discussed in Section 2.6.5.

94

BASIC PROBLEMS AND PROTOCOLS

Exercise 2.9.41 Median Finding. Construct an efﬁcient plug-in so that the median nodes know that they are such. Exercise 2.9.42 Diameter Finding. Design an efﬁcient protocol to determine the diameter of the tree. (Hint: use Lemma 2.6.2.) Exercise 2.9.43 Rank Finding in Tree. Consider a tree where each entity x has an initial value v(x); these values are not necessarily distinct. The rank of an entity x will be the rank of its value; that is, rank(x)= 1 + |{y ∈ V : v(y) < v(x)}. So, whoever has the smallest value, it has rank 1. Design an efﬁcient protocol to determine the rank of a unique initiator (i.e., under the additional restriction UI). Exercise 2.9.44 Generic Rank Finding. Consider the ranking problem described in Exercise 2.9.43. Design an efﬁcient solution protocol that is generic; that is, it works in an arbitrary connected graph. Exercise 2.9.45 Diametral Paths. A path whose length is d is called diametral. Design an efﬁcient protocol so that each entity can determine whether or not it lies on a diametral path of the tree. Exercise 2.9.46 A path whose length is d is called diametral. Design an efﬁcient plug-in so that all and only the entities on a diametral path of the tree become aware of this fact. Exercise 2.9.47 Show that convergecast uses only 1 (one) message less than the saturation stage in general trees. Exercise 2.9.48 Prove that, when an initiator of a TDCast protocol receives the convergecast message from all its children, the initial broadcast is globally terminated. Exercise 2.9.49 Show how to assign efﬁciently a unique id to the entities in a rooted tree. Exercise 2.9.50 Random Entity Selection () Consider the task of selecting uniformly at random an entity in a tree rooted at s. Show how to perform this task, started by the root, with at most 2(n − 1) + dT (s, x) messages and 2r(s) + dT (s, x) ideal time units. Prove both correctness and complexity. Exercise 2.9.51 Show why choosing uniformly at random a site and then choosing uniformly at random an element from that site is not the same as choosing uniformly at random an element from the entire set. Exercise 2.9.52 Random Item Selection () Consider the task of selecting uniformly at random an item from a set of data partitioned among the nodes of a tree rooted at s. Show how to perform this task, started by the root, with at most

EXERCISES, PROBLEMS, AND ANSWERS

95

2(n − 1) + dT (s, x) messages and 2r(s) + dT (s, x) ideal time units. Prove both correctness and complexity. 2.9.2 Problems Problem 2.9.1 Develop an efﬁcient solution to the Traversal problem without the Bidirectional Links assumption. Problem 2.9.2 Develop an efﬁcient solution to the Minimum Finding problem in a hypercube with a unique initiator (i.e., under the additional restriction UI). Note that the values might not be distinct. Problem 2.9.3 Solve the Minimum Finding problem is a system where there is already a leader; that is, under restrictions R ∪ UI. Note that the values might not be distinct. Prove the correctness of your solution, and analyze its efﬁciency. Problem 2.9.4 Ranking. () Consider a tree where each entity x has an initial value v(x); these values are not necessarily distinct. The rank of an entity x will be the rank of its value; that is, rank(x) = 1 + |{y ∈ v : v(y) < v(x)}. So, whoever has the smallest value, has rank 1. Design an efﬁcient protocol to determine the rank of all entities. prove the correctness of your protocol and analyze its complexity. 2.9.3 Answers to Exercises Answer to Exercise 2.9.13 A node appears several times in the virtual ring; more precisely, there is an instance of node z in R for each time z has received a Token or a Finished message. Let x be the initiator; node x sends a Token to each of its neighbors sequentially and receives a Finished message from each. Every node y = x receives exactly one Token (from its parent) and sends one to all its other neighbors (its children); it will also receive a Finished message from all its children and send one to its parent. In other words every node z, including the initiator x, will appear n(z) = |N (z)| times in the virtual ring. The total number of (virtual) nodes in the virtual ring is therefore z∈V |N (z)| = 2m = 2(n − 1). Answer to Exercise 2.9.16 Consider a ring network with the three nodes x, y, and z. Assume that entity x holds the Token initially. Consider the following sequence of events that take place successively in time as a result of the execution of the DF++ protocol: x sends Visited messages to y and z, sends the Token to y, and waits for a (Visited or Return) reply from y. Assume that the link (x, z) is extremely slow. When y receives the Token from x, it sends to z a Visited message and then the Token. Assume that when z receives the Token, the Visited message from x has not arrived yet; hence z sends Visited to x followed by the Token. This is the ﬁrst mistake: Token is sent on a back-edge to x, which has already been visited.

96

BASIC PROBLEMS AND PROTOCOLS

When z ﬁnally receives the Visited message from x, it realizes the Token it sent to x was a mistake. Since it has no other unvisited neighbors, z sends a Return message back to y. Since y has no other unvisited neighbors, it will then send a Return message back to x. Assume that when x receives the Return message from y, x has not received yet neither the Visited nor the Return messages sent by z. Hence, x considers z as an unvisited neighbor and sends the Token to z. This is the second mistake on the back-edge between x and z. Answer to Exercise 2.9.19 Suppose some node x is not reachable from s in the graph T induced by the “parent” relationship. This means that x never sent the Yes messages; this implies that x never received the question Q. This is impossible because, since ﬂooding is correct, every entity will receive Q; thus, no such x exists. Answer to Exercise 2.9.20 Suppose the graph T induced by the “parent” relationship (i.e., the Yes messages) contains a directed cycle x0 , x1 , . . . , xk−1 ; that is, xi is the parent of xi+1 (operations on the indices are modulo k). This cycle cannot contain the initiator s (because it does not send any Yes). We know (Exercise 2.9.19) that in T there is a path from s to each node, including those in the cycle. This means that there will be in T a node y not in the cycle that is connected to a node xi in the cycle. This means that xi sent a Yes message to y; but since it is in the cycle, it also sent a Yes message to xi−1 (operations on the indices are modulo k). This is impossible because an entity sends no more than one Yes message. Answer to Exercise 2.9.31 First show that if a node x sends M to neighbor y, N contains the smallest value in T [x − y]; then, since a saturated node receives by deﬁnition a M message from all neighbors, it knows the minimum value in the network. Prove that value sent by x to y in M is the minimum value in T [x − y] by induction on the height h of T [x − y]. Trivially true if h = 1, that is, x is a leaf. Let it be true up to k ≥ 1; we will now show it is true for h = k + 1. x sends M to y because it has received a value from all its other neighbors y1 , y2 , . . .; since the height of (T [yi − x]) is less than h, then by inductive hypothesis the value sent by yi to x is the minimum value in (T [yi − x]). This means that the smallest among v(x) and all the values received by x is the minimum value in T [x − y]; this is exactly what x sends to y. Answer to Exercise 2.9.41 It is clear that if node x knows |T [y − x]| for all neighbors y, then it can compute G[y, x] and decide whether x is itself a median and, if not, determine the direction of the median. Thus, to ﬁnd a median is sufﬁcient to modify the basic technique to supply this information to the elected node from which the median is approached. This is done by providing two counters, m1 and m2 , with each M message: When a node x sends a M message to y, then m1 = g[T [y − x], y] − 1 and m2 = |T [y − x]| − 1. An active node x processes all received M messages so that, before it sends M to the

BIBLIOGRAPHY

97

last neighbor y, it knows G[T [x − z], x] and |T [z − x]| for all other neighbors z. In particular, the elected node can determine whether it is the median and, if not, can send a message toward it; a node receiving such a message will, in turn, perform the same operations until a median is located. Once again, the total number of exchanged messages is the ones of the Full Saturation plug-in plus d(s,med), where s is the saturated node closer to the medians, and med is the median furthermost from x. Partial Answer to Exercise 2.9.48 By induction on the height of the rooted tree, prove that, in a TDCast protocol, when an entity x receives the convergecast message from all its children, all its descendants have locally terminated the broadcast. Partial Answer to Exercise 2.9.49 Perform ﬁrst a broadcast from the root to notify all entities of the start of the protocol, and then a convergecast to collect at each entity the number of its descendents. Afterwards use this information to assign distinct values to the entities according to a preorder traversal of the tree. Partial Answer to Exercise 2.9.51 Show that the data items from smaller sets will be chosen with higher probability than that of the items from larger sets. BIBLIOGRAPHY [1] D. Angluin. Local and global properties in networks of processors. In Proc. of the 12th ACM STOC Symposium on Theory of Computing, pages 82–93, 1980. [2] B. Awerbuch. A new distributed depth-ﬁrst search algorithm. Information Processing Letters, 20:147–150, 1985. [3] E.J.H. Chang. Echo algorithms: Depth parallel operations on general graphs. IEEE Transactions on Software Engineering, SE-8(4):391–401, July 1982. [4] I. Cidon. Yet another distributed depth-ﬁrst search algorithm. Information Processing Letters, 26:301–305, 1987. [5] S. Dobrev, R. Kralovic, and N. Santoro. On the difﬁculty of waking up. In print, 2006. [6] S. Dobrev and P. Ruzicka. Linear broadcasting and O(n log log n) election in unoriented hypercubes. In Proc. of the 4th International Colloquium on Structural Information and Communication Complexity, (Sirocco’97), Ascona, July 1997. To appear. [7] E. Korach, D. Rotem, and N. Santoro. Distributed algorithms for ranking the nodes of a network. In 13th SE Conf. on Combinatorics, Graph Theory and Computing, volume 36 of Congressus Numeratium, pages 235–246, Boca Raton, February 1982. [8] E. Korach, D. Rotem, and N. Santoro. Distributed algorithms for ﬁnding centers and medians in networks. ACM Transactions on Programming Languages and Systems, 6(3):380–401, July 1984. [9] K.B. Lakshmanan, N. Meenakshi, and K. Thulasiraman. A time-optimal message-efﬁcient distributed algorithm for depth-ﬁrst search. Information Processing Letters, 25:103–109, 1987.

98

BASIC PROBLEMS AND PROTOCOLS

[10] N. Santoro. Determining topology information in distributed networks. In Proc. 11th SE Conf. on Combinatorics, Graph Theory and Computing, Congressus Numeratium, pages 869–878, Boca Raton, February 1980. [11] A. Segall. Distributed network protocols. IEEE Transactions on Information Theory, IT-29(1):23–35, Jan 1983. [12] D. Wall. Mechanisms for broadcast and selective broadcast. PhD thesis, Stanford University, June 1980. [13] Shmuel Zaks. Optimal distributed algorithms for sorting and ranking. IEEE Transactions on Computers, 34:376–380, 1985.

CHAPTER 3

Election

3.1 INTRODUCTION In a distributed environment, most applications often require a single entity to act temporarily as a central controller to coordinate the execution of a particular task by the entities. In some cases, the need for a single coordinator arises from the desire to simplify the design of the solution protocol for a rather complex problem; in other cases, the presence of a single coordinator is required by the nature of the problem itself. The problem of choosing such a coordinator from a population of autonomous symmetric entities is known as Leader Election (Elect). Formally, the task consists in moving the system from an initial conﬁguration where all entities are in the same state (usually called available) into a ﬁnal conﬁguration where all entities are in the same state (traditionally called follower), except one, which is in a different state (traditionally called leader). There is no restriction on the number of entities that can start the computation, nor on which entity should become leader. We can think of the Election problem as the problem of enforcing restriction Unique Initiator in a system where actually no such restriction exists: The multiple initiators would ﬁrst start the execution of an Election protocol; the sole leader will then be the unique initiator for the subsequent computation. As election provides a mechanism for breaking the symmetry among the entities in a distributed environment, it is at the base of most control and coordination processes (e.g., mutual exclusion, synchronization, concurrency control, etc.) employed in distributed systems, and it is closely related to other basic computations (e.g., minimum ﬁnding, spanning-tree construction, traversal). 3.1.1 Impossibility Result We will start considering this problem under the standard restrictions R: Bidirectional Links, Connectivity, and Total Reliability. There is unfortunately a very strong impossibility result about election. Theorem 3.1.1 Problem Elect is deterministically unsolvable under R.

Design and Analysis of Distributed Algorithms, by Nicola Santoro Copyright © 2007 John Wiley & Sons, Inc.

99

100

ELECTION

FIGURE 3.1: Electing a leader.

In other words, there is no deterministic protocol that will always correctly terminate within ﬁnite time if the only restrictions are those in R. To see why this is the case, consider a simple system composed of two entities, x and y, both initially available and with no different initial values; in other words, they are initially in identical states. If a solution protocol P exists, it must work under any conditions of message delays. Consider a synchronous schedule (i.e., an execution where communication delays are unitary) and let the two entities start the execution of P simultaneously. As they are in identical states, they will execute the same rule, obtain the same result, and compose and send (if any) the same message; thus, they will still be in identical states. If one of them receives a message, the other will receive the same message at the same time and, by Property 1.6.2, they will perform the same computation, and so on. Their state will always be the same; hence if one becomes leader, so will the other. But this is against the requirement that there should be only one leader; in other words, P is not a solution protocol. 3.1.2 Additional Restrictions The consequence of Theorem 3.1.1 is that to break symmetry, we need additional restrictions and assumptions. Some restrictions are not powerful enough. This is the case, for example, with the assumption that there is already available a spanning tree (i.e., restriction Tree). In fact, the two-node network in which we know election is impossible is a tree. To determine which restrictions, added to R, will enable us to solve Elect, we must consider the nature of the problem. The entities have an inherent behavioral symmetry: They all obey the same set of rules plus they have an initial state symmetry (by deﬁnition of election problem). To elect a leader means to break these symmetries; in fact, election is also called symmetry breaking. To be able to do so, from the start there must be something in the system that the entities can use, something that makes (at least one of) them different. Remember that any restriction limits the applicability of the protocol. The most obvious restriction is Unique Initiator (UI): The unique initiator, known to be unique, becomes the leader. This is, however, “sweeping the problem under the carpet,” saying that we can elect a leader if there is already a leader and it knows about it. The problem is to elect a leader when many (possibly, all) entities are initiators; thus, without UI.

INTRODUCTION

101

The restriction that is commonly used is a very powerful one, Initial Distinct Values (ID), which we have already employed to circumvent a similar impossibility result for constructing a spanning tree with multiple initiators (see Section 2.5.5). Initial distinct values are sometimes called identiﬁers or ids or global names and, as we will see, their presence will be sufﬁcient to elect a leader; let id(x) denote the distinct value of x. The use of this additional assumption is so frequent that the set of restrictions IR = R ∪ {ID} is called the standard set for election. 3.1.3 Solution Strategies How can the difference in initial values be used to break the symmetry and to elect a leader? According to the election problem speciﬁcations, it does not matter which entity becomes the leader. Using the fact that the values are distinct, a possible strategy is to choose as a leader the entity with the smallest value; in other words, an election strategy is as follows: Strategy Elect Minimum: 1. ﬁnd the smallest value; 2. elect as a leader the entity with that value. IMPORTANT. Finding the minimum value is an important problem of its own, which we have already discussed for tree networks (Section 2.6.2). Notice that in that occasion, we found the minimum value without unique identiﬁers; it is the election problem that needs them. A useful variant of this strategy is the one restricting the choice of the leader to the set of entities that initiate the protocol. That is, Strategy Elect Minimum Initiator: 1. ﬁnd the smallest value among the initiators; 2. elect as a leader the entity with that value. IMPORTANT. Notice that any solution implementing the strategy Elect Minimum solves Min as well as Elect, not so the ones implementing Elect Minimum Initiator. Similarly, we can deﬁne the Elect Maximum and the Elect Maximum Initiator strategies. Another strategy is to use the distinct values to construct a rooted spanning tree of the network and to elect the root as the leader. In other words, an election strategy is as follows:

102

ELECTION

Strategy Elect Root: 1. construct a rooted spanning tree; 2. elect as the leader the root of the tree. IMPORTANT. Constructing a (rooted) spanning tree is an important problem of its own, which we have already discussed among the basic problems (Section 2.5 ). Recall that SPT, like Elect, is unsolvable under R. In the rest of this chapter, we will examine how to use these strategies to solve Elect under election’s standard set of restrictions IR = R ∪{ID}. We will do so by ﬁrst examining special types of networks and then focusing on the development of topology-independent solutions.

3.2 ELECTION IN TREES The tree is the connected graph with the “sparsest" topology: m = n − 1. We have already seen how to optimally ﬁnd the smallest value using the saturation technique: protocol MinF-Tree in Section 2.6.2. Hence the strategy Elect Minimum leads to an election protocol Tree:Elect Min where the number of messages in the worst case is as follows: M[Tree:Elect Min] = 3n + k∗ − 4 ≤ 4n − 4. Interestingly, also the strategy Elect Minimum Initiator will have the same complexity (Exercise 3.10.1). Consider now applying the strategy Elect Root. As the network is a tree, the only work required is to transform it into a rooted tree. It is not difﬁcult to see how saturation can be used to solve the problem. In fact, if Full Saturation is applied, then a saturated node knows that it itself and its parent are the only saturated nodes; furthermore, as a result of the saturation stage, every nonsaturated entity has identiﬁed as its parent the neighbor closest to the saturated pair. In other words, saturation will root the tree not in a single node but in a pair of neighbors: the saturated ones. Thus, to make the tree rooted in a single node we just need to choose only one of the two saturated nodes. In other words, the “Election” among all the nodes is reduced to an “election” between the two saturated ones. This can be easily accomplished by having the saturated nodes communicate their identities and by having the node with the smallest identity become elected, while the other stays processing. Thus, the Tree:Elect Root protocol will be Full Saturation with the new rules and the routine Resolve shown in Figure 3.2. The number of message transmissions for the election algorithm Tree Election will be exactly the same as the one experienced by Full Saturation with notiﬁcation

ELECTION IN TREES

103

SATURATED Receiving(Election, id∗) begin if id(x) < id∗ then become LEADER; else become FOLLOWER; endif send("Termination") to N (x) − {parent}; end PROCESSING Receiving("Termination") begin become FOLLOWER; send("Termination") to N(x) − {parent}; end Procedure Resolve begin send("Election",id(x)) to parent; become SATURATED; end

FIGURE 3.2: New rules and routine Resolve used for Tree:Elect Root.

plus two “Election” messages, that is, M[Tree:Elect Root]= 3n + k∗ − 2 ≤ 4n − 2. In other words, it uses two messages more than the solution obtained using the strategy Elect Minimum. Granularity of Analysis: Bit Complexity From the discussion above, it would appear that the strategy Elect Minimum is “better” because it uses two messages less than the strategy Elect Root. This assessment is indeed the only correct conclusion obtainable using the number of messages as the cost measure. Sometimes, this measure is too “coarse” and does not really allow us to see possibly important details; to get a more accurate picture, we need to analyze the costs at a “ﬁner” level of granularity. Let us re-examine the two strategies in terms of the number of bits. To do so, we have to distinguish between different types of messages because some contain counters and values, while others contain only a message identiﬁer. IMPORTANT. Messages that do not carry values but only a constant number of bits are called signals and in most practical systems, they have signiﬁcantly less communication costs than value messages. In Elect Minimum, only the n messages in the saturation stage carry a value, while all the others are signals; hence, the total number of bits transmitted will be B[Tree:Elect Min] = n (c + log id) + c (2n + k∗ − 2),

(3.1)

104

ELECTION

where id denotes the largest value sent in a message, and c = O(1) denotes the number of bits required to distinguish among the different messages. In Elect Root, only the “Election” message carries a node identity; thus, the total number of bits transmitted is B[Tree:Elect Root] = 2 (c + log id) + c (3n + k∗ − 2).

(3.2)

That is, in terms of number of bits, Elect Root is an order of magnitude better than Elect Minimum. In terms of signals and value messages, with Elect Root strategy we have only two value messages and with Elect Minimum strategy we have n value messages. Remember: Measuring the number of bits gives us always a “picture” of the efﬁciency at a more reﬁned level of granularity. Fortunately, it is not always necessary to go to such a level.

3.3 ELECTION IN RINGS We will now consider a network topology that plays a very important role in distributed computing: the ring, sometimes called loop network. A ring consists of a single cycle of length n. In a ring, each entity has exactly two neighbors, (whose associated ports are) traditionally called left and right (see Figure 3.3). IMPORTANT. Note that the labeling might, however, be globally inconsistent, that is, ‘right’ might not have the same meaning for all entities. We will return to this point later. x n−1

x0 x1

x n−2

FIGURE 3.3: A ring network.

x2

ELECTION IN RINGS

105

After trees, rings are the networks with the sparsest topology: m = n; however, unlike trees, rings have a complete structural symmetry (i.e., all nodes look the same). We will denote the ring by R = (x0 , x1 , . . . , xn−1 ). Let us consider the problem of electing a leader in a ring R, under the standard set of restrictions for election, IR = {Bidirectional Links, Connectivity, Total Reliability, Initial Distinct Values}, as well as the knowledge that the network is a ring (Ring). Denote by id(x) the unique value associated to x. Because of its structure, in a ring we will use almost exclusively the approach of minimum ﬁnding as a tool for leader election. In fact we will consider both the Elect Minimum and the Elect Minimum Initiator approaches. Clearly the ﬁrst solves both Min and Elect, while the latter solves only Elect. NOTE. Every protocol that elects a leader in a ring can be made to ﬁnd the minimum value (if it has not already been determined) with an additional n message and time (Exercise 3.10.2). Furthermore, in the worst case, the two approaches coincide: All entities might be initiators. Let us now examine how minimum ﬁnding and election can be efﬁciently performed in a ring. As in a ring each entity has only two neighbors, for brevity we will use the notation other to indicate N (x)−sender at an entity x. 3.3.1 All the Way The ﬁrst solution we will use is rather straightforward: When an entity starts, it will choose one of its two neighbors and send to it an “Election” message containing its id; an entity receiving the id of somebody else will send its id (if it has not already done so) and forward the received message along the ring (i.e., send it to its other neighbor) keeping track of the smallest id seen so far (including its own). This process can be visualized as follows: Each entity originates a message (containing its id), and this message travels “all the way” along the ring (forwarded by the other entities) (see Figure 3.4). Hence, the name All the Way will be used for the resulting protocol. Each entity will eventually see the id of everybody else id (ﬁnite communication delays and total reliability ensure that) including the minimum value; it will, thus, be able to determine whether or not it is the (unique) minimum and, thus, the leader. When will this happen ? In other words, Question. When will an entity terminate its execution? Entities only forward messages carrying values other than their own: Once the message with id(x) arrives at x, it is no longer forwarded. Thus, each value will travel “All the Way” along the ring only once. So, the communication activities will eventually terminate. But how does an entity know that the communication activities

106

ELECTION

...

5

..

4

.

...

22

4

5

22 13

...

2

.. .

13 2

17

...

17

FIGURE 3.4: All the Way: Every id travels along the ring.

have terminated, that no more messages will be arriving, and, thus, the smallest value seen so far is really the minimum id? Consider a “reasonable” but unfortunately incorrect answer: An entity knows that it has seen all values once it receives its value back. The “reason” is that the message with its own id has to travel longer along the ring to reach x than those originated by other entities; thus, these other messages will be received ﬁrst. In other words, reception of its own message can be used to detect termination. This reasoning is incorrect because it uses the (hidden) additional assumption that the system has ﬁrst in ﬁrst out (FIFO) communication channels, that is, the messages are delivered in the order in which they arrive. This restriction, called Message Ordering, is not a part of election’s standard set; few systems actually have it built in, and the costs of offering it can be formidable. So, whatever the answer, it must not assume FIFO channels. With this proviso, a “reasonable” but unfortunately still incorrect answer is the following: An entity counts how many different values it receives; when the counter is equal to n, it knows it can terminate.

ELECTION IN RINGS

107

PROTOCOL All the Way.

States: S = {ASLEEP, AWAKE, FOLLOWER, LEADER}; SINIT = {ASLEEP}; STERM = {FOLLOWER, LEADER}.

Restrictions: IR ∪Ring. ASLEEP

Spontaneously begin INITIALIZE; become AWAKE; end Receiving("Election", value∗ , counter∗ ) begin INITIALIZE; send("Election", value∗ , counter∗ +1) to other; min:= Min{ min, value}; count:= count+1; become AWAKE; end AWAKE

Receiving("Election", value∗ , counter∗ ) begin if value = id(x) then send("Election", value∗ , counter∗ +1) to other; min:= MIN{min,value∗ }; count:= count+1; if known then CHECK endif; else ringsize:= counter∗ ; known:= true; CHECK; endif end

FIGURE 3.5: Protocol All the Way.

The problem is that this answer assumes that the entity knows n, but a priori knowledge of the ring size is not a part of the standard restrictions for election. So it cannot be used. It is indeed strange that the termination should be difﬁcult for such a simple protocol in such a clear setting. Fortunately, the last answer, although incorrect, provides us with the way out. In fact, although n is not known a priori, it can be computed. This is easily accomplished by having a counter in the Election message, initialized to 1 and incremented by each entity forwarding it; when an entity receives its id back, the value of the counter will be n. Summarizing, we will use a counter at each entity, to keep track of how many different ids are received and a counter in each message, so that each entity can determine n. The protocol is shown in Figures 3.5 and 3.6. The message originated by each entity will travel along the ring exactly once. Thus, there will be exactly n2 messages in total, each carrying a counter and a value,

108

ELECTION

Procedure INITIALIZE begin count:= 0; size:= 1; known:= false; send("Election", id(x), size) to right; min:= id(x); end Procedure CHECK begin if count = ringsize then if min = id(x) then become LEADER; else become FOLLOWER; endif endif end

FIGURE 3.6: Procedures of protocol All the Way.

for a total of n2 log(id + n) bits. The time costs will be at most 2n (Exercise 3.10.3). Summarizing, M[AlltheWay] = n2

(3.3)

T[AlltheWay] ≤ 2n − 1.

(3.4)

The solution protocol we have just designed is very expensive in terms of communication costs (in a network with 100 nodes it would cause 10, 000 message transmissions). The protocol can be obviously modiﬁed so as to follow strategy Elect Minimum Initiator, ﬁnding the smallest value only among the initiators. In this case, those entities that do not initiate will not originate a message but just forward the others’. In this way, we would have fewer messages whenever there are fewer initiators. In the modiﬁcation we must be careful. In fact, in protocol All the Way, we were using an entity’s own message to determine n so as to be able to determine local termination. Now some entities will not have this information. This means that termination is again a problem. Fortunately, this problem has a simple solution requiring only n additional messages and time (Exercise 3.10.4). Summarizing, the costs of the modiﬁed protocol, All the Way:Minit, are as follows: M[AlltheWay : Minit] = nk∗ + n

(3.5)

T[AlltheWay : Minit] ≤ 3n − 1

(3.6)

The modiﬁed protocol All the Way:Minit will in general use fewer messages than the original one. In fact, if only a constant number of entities initiate, it will use only

109

ELECTION IN RINGS

O(n) messages, which is excellent. By contrast, if every entity is an initiator, this protocol uses n messages more than the original one. IMPORTANT. Notice that All the Way (in its original or modiﬁed version) can be used also in unidirectional rings with the same costs. In other words, it does not require the Bidirectional Links restriction. We will return to this point later. 3.3.2 As Far As It Can To design an improved protocol, let us determine the drawback of the one we already have: All the Way. In this protocol, each message travels all along the ring. Consider the situation (shown in Figure 3.7) of a message containing a large id, say 22, arriving at an entity x with a smaller id, say 4. In the existing protocol, x will forward this message, even though x knows that 22 is not the smallest value. But our overall strategy is to determine the smallest id among all entities; if an entity determines that an id is not the minimum, there is no need whatsoever for the message containing such an id to continue traveling along the ring. We will thus modify the original protocol All the Way so that an entity will only forward Election messages carrying an id smaller than the smallest seen so far by 2

2

4

4

5

22

5

4

22

13

2

13 5

4

17 17 13 13 5

2 4 2

5 4 2

FIGURE 3.7: Message with a larger id does not need to be forwarded.

2

110

ELECTION

that entity. In other words, an entity will become an insurmountable obstacle for all messages with a larger id “terminating” them. Let us examine what happens with this simple modiﬁcation. Each entity will originate a message (containing its id) that travels along the ring “as far as it can”: until it returns to its originator or arrives at a node with a smaller id. Hence the name AsFar (As It Can) will be used for the resulting protocol. Question. When will an entity terminate its execution? The message with the smallest id will always be forwarded by the other entities; thus, it will travel all along the ring returning to its originator. The message containing another id will instead be unable to return to its originator because it will ﬁnd an entity with a smaller id (and thus be terminated) along the way. In other words, only the message with the smallest id will return to its originator. This fact provides us with a termination detection mechanism. If an entity receives a message with its own id, it knows that its id is the minimum, that is, it is the leader; the other entities have all seen that message pass by (they forwarded it) but they still do not know that there will be no smaller ids to come by. Thus, to ensure their termination, the newly elected leader must notify them by sending an additional message along the ring. Message Cost This protocol will deﬁnitely have fewer messages than the previous one. The exact number depends on several factors. Consider the cost caused by the Election message originated by x. This message will travel along the ring until it ﬁnds a smaller id (or complete the tour). Thus, the cost of its travel depends on how the ids are allocated on the ring. Also notice that what matters is whether an id is smaller or not than another and not their actual value. In other words, what is important is the rank of the ids and how those are situated on the ring. Denote by #i the id whose rank is i. Worst Case Let us ﬁrst consider the worst possible case. Id #1 will always travel all along the ring costing n messages. Id #2 will be stopped only by id #1; so its cost in the worst case is n − 1, achievable if id #2 is located immediately after id #1 in the direction it travels. In general, id #(i + 1) will be stopped by any of those with smaller rank, and, thus, it will cost at most n − i messages; this will happen if all those entities are next to each other, and id #(i + 1) is located immediately after them in the direction it will travel. In fact, all the worst cases for each of the ids are simultaneously achieved when the ids are arranged in an (circular) order according to their rank and all messages are sent in the “increasing” direction (see Figure 3.9). In this case, including also the n messages required for the ﬁnal notiﬁcation, the total cost will be

M[AsFar] = n +

n i=1

i=

n (n + 3) . 2

(3.7)

ELECTION IN RINGS

PROTOCOL AsFar.

States: S = {ASLEEP, AWAKE, FOLLOWER, LEADER}; SINIT = {ASLEEP}; STERM = {FOLLOWER, LEADER}.

Restrictions: IR ∪Ring. ASLEEP

Spontaneously begin INITIALIZE; become AWAKE; end Receiving("Election", value) begin INITIALIZE; if value < min then send("Election", value) to other; min:= value; endif become AWAKE; end AWAKE Receiving("Election", value) begin if value < min then send("Election", value) to other; min:= value; else if value min then NOTIFY endif; endif end Receiving(Notify) send(Notify) to other; become FOLLOWER; end

where the procedures Initialize and Notify are as follows: Procedure INITIALIZE begin send("Election", id(x)) to right; min:= id(x); end Procedure NOTIFY begin send(Notify) to right; become LEADER; end

FIGURE 3.8: Protocol AsFar.

111

112

ELECTION

2 3

1

14

4

5

13

6

12

7 11

8

10 9

FIGURE 3.9: Worst case setting for protocol AsFar.

That is, we will cut the number of messages at least to half. From a theoretical point of view, the improvement is not signiﬁcant; from a practical point of view, this is already a reasonable achievement. However we have so far analyzed only the worst case. In general, the improvement will be much more signiﬁcant. To see precisely how, we need to perform a more detailed analysis of the protocol’s performance. IMPORTANT. Notice that AsFar can be used in unidirectional rings. In other words, it does not require the Bidirectional Links restriction. We will return to this point later. The worst case gives us an indication of how “bad” things could get when the conditions are really bad. But how likely are such conditions to occur? What costs can we generally expect? To ﬁnd out, we need to study the average case and determine the mean and the variance of the cost of the protocol. Average Case: Oriented Ring We will ﬁrst consider the case when the ring is oriented, that is, “right” means the same to all entities. In this case, all messages will travel in only one direction, say clockwise. IMPORTANT. Because of the unique nature of the ring network, this case coincides with the execution of the protocol in a unidirectional ring. Thus, the results we will obtain will hold for those rings.

ELECTION IN RINGS

113

To determine the average case behavior, we consider all possible arrangements of the ranks 1, . . . , n in the ring as equally likely. Given a set of size a, we denote by C(a, b) the number of subsets of size b that can be formed from it. Consider the id #i with rank i; it will travel clockwise exactly k steps if and only if the ids of its k − 1 clockwise neighbors are larger than it (and thus will forward it), while the id of its kth clockwise neighbor is smaller (and thus will terminate it). There are i − 1 ids smaller than id #i from which to choose those k − 1 smaller clockwise neighbors, and there are n − i ids larger than id #i from which to choose the kth clockwise neighbor. In other words, the number of situations where id #i will travel clockwise exactly k steps is C(i − 1, k − 1)C(n − i, 1), out of the total number of C(n − 1, k − 1)C(n − k, 1) possible situations. Thus, the probability P (i, k) that id #i will travel clockwise exactly k steps is C(i − 1, k − 1)C(n − i, 1) . C(n − 1, k − 1)C(n − k, 1)

P (i, k) =

(3.8)

The smallest id, #1, will travel the full length n of the ring. The id #i, i > 1, will travel less; the expected distance will be Ei =

n−1

k P (i, k).

(3.9)

k=1

Therefore, the overall expected number of message transmissions is E =n+

n−1 n−1 i=1 k=1

k P (i, k) = n +

n−1 k=1

n = nHn , k+1

(3.10)

where Hn = 1 + 21 + 13 + ... + n1 is the nth Harmonic number. To obtain a close formula, we use the fact that the function f (x) = x1 is continu 1 n 1 ∞ 1 ous, linear, and decreasing; thus 1 x dx = limn→∞ 1 x dx = limn→∞ ln x = n limn→∞ (ln n − ln 1 + c) = ln n + c. Hence, Hn = ln n + O(1) ≈ .69 log n + O(1); thus Theorem 3.3.1 In oriented and in unidirectional rings, protocol AsFar will cost nHn ≈ .69n log n + O(n) messages on an average. This is indeed great news: On an average, the message cost is an order of magnitude less than that in the worst case. For n = 1024, this means that on an average we have 7066 messages instead of 525, 824, which is a considerable difference. If we use the strategy of electing the Minimum Initiator instead, we obtain the same bound but as a function of the number k∗ of initiators:

114

ELECTION

Theorem 3.3.2 In oriented and in unidirectional rings, protocol AsFar-Minit will cost nHk∗ ≈ .69n log k∗ messages on an average. Average Case: Unoriented Ring Let us now consider what will happen on an average in the general case, when the ring is unoriented. As before, we consider all possible arrangements of the ranks 1, . . . , n of the values in the ring as equally likely. The fact that the ring is not oriented means that when two entities send a message to their “right” neighbors, they might send it in different directions. Let us assume that at each entity the probability that “right” coincides with the clockwise direction is 21 . Alternatively, assume that an entity, as its ﬁrst step in the protocol, ﬂips a fair coin (i.e., probability 21 ) to decide the direction it will use to send its value. We shall call the resulting probabilistic protocol ProbAsFar. √

(2) 2 nHn

Theorem 3.3.3 In unoriented rings, Protocol ProbAsFar will cost .49n log n messages on an average.

≈

A similar bound holds if we use the strategy of electing the Minimum Initiator: √

Theorem 3.3.4 In unoriented rings, protocol ProbAsFar-Minit will cost .49n log k messages on an average.

(2) 2 nHk∗

≈

What is very interesting about the bound expressed by Theorem 3.3.3 is that it is better (i.e., smaller) than the one expressed by Theorem 3.3.1. The difference between the two bounds is restricted to the constant and is rather limited. In numerical terms, the difference is not outstanding: 5018 instead of 7066 messages on an average when n = 1024. In practical terms, from the algorithm design point of view, it indicates that we should try to have the entities send their initial message in different directions (as in the probabilistic protocol) and not all in the same one (like in the oriented case). To simulate the initial “random” direction, different means can be used. For example, each entity x can choose (its own) “right” if id(x) is even, (its own) “left” otherwise. This result has also a theoretical relevance that will become apparent later, when we will discuss lower bounds and will have a closer look at the nature of the difference between oriented and unoriented rings. Time Costs The time costs are the same as the ones of All the Way plus an additional n − 1 for the notiﬁcation. This can, however, be halved by exploiting the fact that the links are bidirectional and by broadcasting the notiﬁcation; this will require an extra message but halve the time. Summary The main drawback of protocol AsFar is that there still exists the possibility that a very large number of messages (O(n2 )) will be exchanged. As we have seen, on an average, the use of the protocol will cost only O(n log n) messages. There

ELECTION IN RINGS

115

is, however, no guarantee that this will happen the next time the protocol will be used. To give such a guarantee, a protocol must have a O(n log n) worst case complexity. 3.3.3 Controlled Distance We will now design a protocol that has a guaranteed O(n log n) message performance. To achieve this goal, we must ﬁrst of all determine what causes the previous protocol to use O(n2 ) messages and then ﬁnd ways around it. The ﬁrst thing to observe is that in AsFar (as well as in All the Way), an entity makes only one attempt to become leader and does so by originating a message containing its id. Next observe that, once this message has been created and sent, the entity has no longer any control over it: In All the Way the message will travel all along the ring; in AsFar it will be stopped if it ﬁnds a smaller id. Consider now the situation that causes the worst case for protocol AsFar: this is when the ids are arranged in an increasing order along the ring, and all entities identify “right” with the clockwise direction (see Figure 3.9). The entity x with id 2 will originate a message that will cause n − 2 transmissions. When x receives the message containing id 1, x ﬁnds out that its own value is not the smallest, and thus its message is destined to be wasted. However, x has no means to stop it as it has no longer any control over that message. Let us take these observations into account to design a more efﬁcient protocol. The key design goal will be to make an entity retain some control over the message it originates. We will use several ideas to achieve this: 1. limited distance: The entity will impose a limit on the distance its message will travel; in this way, the message with id 2 will not travel “as far as it can” (i.e., at distance n − 2) but only up to some predeﬁned length. 2. return (or feedback) messages: If, during this limited travel, the message is not terminated by an entity with smaller id, it will return back to its originator to get authorization for further travel; in this way, if the entity with id 2 has seen id 1, it will abort any further travel of its own message. Summarizing, an entity x will originate a message with its own id, and this message will travel until it is terminated or it reaches a certain distance dis; if it is not terminated, the message returns to the entity. When it arrives, x knows that on this side of the ring, there are no smaller ids within the traveled distance dis. The entity must now decide if to allow its message to travel a further distance; it will do so only if it knows for sure that there are no smaller ids within distance dis on the other side of the ring as well. This can be achieved as follows: 3. check both sides: The entity will send a message in both directions; only if they both return, they will be allowed to travel a further distance. As a consequence, instead of a single global attempt at leadership, an entity will go through several attempts, which we shall call Electoral Stages: An entity enters the

116

ELECTION

dis

i

dis

i+1

dis

i

dis

i+1

FIGURE 3.10: Controlled distances: A message travels no more than dis(i); if it is not discarded, a feedback is sent back to the originator. A candidate that receives a feedback from both sides starts the next stage.

next stage only if it passes the current one (i.e., both messages return) (see Fig. 3.10). If an entity is defeated in an electoral stage (i.e., at least one of its messages does not return), it still will have to continue its participation in the algorithm forwarding the messages of those entities that are still undefeated. Although the protocol is almost all outlined, some fundamental issues are still unresolved. In particular, the fact that we now have several stages can have strange consequences in the execution. IMPORTANT. Because of variations in communication delays, it is possible that at the same time instant, entities in different parts of the ring are in different electoral stages. Furthermore, as we are only using the standard restrictions for elections, messages can be delivered out of order; thus, it might be possible that messages from a higher stage will arrive at an entity before the ones from the current one. We said that an entity is defeated if it does not receive one of its messages back. Consider now an entity x; it has sent its two messages and it is now waiting to know the outcome. Let us say that one of its messages has returned but the other has not yet. It is possible that the message is coming very slowly (e.g., experiencing long transmission delays) or that it is not coming at all (i.e., it found a smaller id on the way). How can x know ? How long will x have to wait before taking a decision (a decision must be taken within ﬁnite time)? More speciﬁcally, what will x do if, in the meanwhile, it receives a message from a higher stage ? The answer to all these

ELECTION IN RINGS

117

questions is fortunately simple: 4. the smallest id wins: If, at any time, a candidate entity receives message with a smaller id, it will become defeated, regardless of the stage number. Notice that this creates a new situation: A message returns to its originator and ﬁnds it defeated; in this case, the message will be terminated. The ﬁnal issue we need to address is termination. The limit to the travel distance for a message in a given stage will depend on the stage itself; let disi denote the limit in stage i. Clearly, these distances must be monotonically increasing, that is, disi > disi−1 . The messages from an entity whose id is not the minimum will sooner or later encounter a smaller id in their travel and will not return to their originator. Consider now the entity s with the smallest id. In each stage, both of its messages will travel the full allocated distance (as no entity can terminate them) and return, making s enter the next stage. This process will continue until disi ≥ n; at this time, each message will complete a full tour of the ring reaching s from the other side. When this happens, s will know that it has the smallest value and, thus, it is the leader. It will then start a notiﬁcation process so that all the other entities can enter a terminal state. A synthetic description of the protocol will thus be as follows: in each electoral stage there are some candidates; each candidate sends a message in both directions carrying its own id (as well as the stage number); a message travels until it encounters a smaller id or it reaches a certain distance (whose value depends on the stage); if a message does not encounter a smaller id, it will return back to its originator; a candidate that receives both of its own messages back survives this stage and starts the next one; with three meta rules: if a candidate receives its message from the opposite side it sent to, it becomes the leader and notiﬁes all the other entities of termination; if a candidate receives a message with a smaller id, it becomes defeated, regardless of the stage number; a defeated entity forwards the messages originating from the other entities; if the message is notiﬁcation of termination, it will terminate. The fully speciﬁed protocol Control is shown in Figures 3.11 and 3.12, where dis is a monotonically increasing function. Correctness The correctness of the algorithm follows from the dynamics of the rules: The messages containing the smallest id will always travel all the allocated

118

ELECTION

PROTOCOL Control.

States: S = {ASLEEP, CANDIDATE, DEFEATED, FOLLOWER, LEADER}; SINIT = {ASLEEP}; STERM = {FOLLOWER, LEADER}.

Restrictions: IR ∪Ring. ASLEEP

Spontaneously begin INITIALIZE; become CANDIDATE; end Receiving("Forth", id*, stage*, limit*) begin if id* < id(x) then PROCESS-MESSAGE; become DEFEATED else INITIALIZE; become CANDIDATE; endif end CANDIDATE Receiving("Forth", id*, stage*, limit*) begin if id* < id(x) then PROCESS-MESSAGE; become DEFEATED else if id* = id(x) then NOTIFY endif; endif end Receiving("Back", id*) begin if id* = id(x) then CHECK endif; end Receiving(Notify) begin send(Notify) to other; become FOLLOWER; end DEFEATED Receiving() begin send() to other; if = Notify then become FOLLOWER endif; end

FIGURE 3.11: Protocol Control.

ELECTION IN RINGS

119

Procedure INITIALIZE begin stage:= 1; limit:= dis(stage); count:= 0; send("Forth", id(x), stage, limit) to N(x); end Procedure PROCESS-MESSAGE begin limit*:=limit*-1; if limit* =0 then send("Back",id*, stage*) to sender; else send("Forth", id*, stage*, limit*) to other; endif end Procedure CHECK begin count:=count+1; if count = 1 then count:= 0 stage:= stage+1 limit:= dis(stage); send("Forth", id(x), stage, limit) to N (x); endif end Procedure NOTIFY begin send(Notify) to right; become LEADER; end

FIGURE 3.12: Procedures used by protocol Control.

distance, and every entity still candidate they encounter will be transformed in defeated; the distance is monotonically increasing in the number of stages; hence, eventually, the distance will be at least n. When this happens, the messages with the smallest value will travel all along the ring; as a result, their originator becomes leader and all the others are already defeated. Costs The costs of the algorithm depend totally on the choice of the function dis used to determine the maximum distance a “Forth” message can travel in a stage. Messages If we examine the execution of the protocol at some global time t, because communication delays are unpredictable, we can ﬁnd not only that entities in different parts of the ring are in different states (which is expected) but also that entities in the candidate state are in different stages. Moreover, because there is no Message Ordering, messages from high stages (the “future”) might overtake messages from lower stages and arrive at an entity still in a lower stage (the “past”). Still, we can visualize the execution as proceeding in logical stages; it is just that different entities might be executing the same stage at different times.

120

ELECTION

Focus on stage i > 1 and consider the entities that will start this stage; these ni entities are those that survived stage i − 1. To survive stage i − 1, the id of x must be smaller than the ids of its neighbors at distance up to dis(i) on each side of the ring. Thus, within any group of dis(i) + 1 consecutive entities, at most one can survive stage i − 1 and start stage i. In other words, ni ≤

n . dis(i − 1) + 1

(3.11)

An entity starting stage i will send “Forth” messages in both directions; each message will travel at most dis(i), for a total of 2ni dis(i) message transmissions. Let us examine now the “Back” messages. Each entity that survives this stage will receive such a message from both sides; as ni+1 entities survive this stage, this gives an additional 2ni+1 dis(i) messages. Each entity that started but did not survive stage i will receive either no or at most one “Back” message, causing a cost of at most dis(i); as there are ni − ni+1 such entities, they will cost no more than an additional (ni − ni+1 )dis(i) messages in total. So, in total, the transmissions for “Back” messages are at most 2ni+1 dis(i) + (ni − ni+1 )dis(i). Summarizing, the total number of messages sent in stage i > 1 will be no more than 2 ni dis(i) + 2 ni+1 dis(i) + (ni − ni+1 ) dis(i) = (3 ni + ni+1 ) dis(i)

n dis(i) n ≤ 3 dis(i−1)+1 + dis(i)+1 dis(i) < n 3 dis(i−1) +1 . The ﬁrst stage is a bit different, as every entity starts; the n2 entities that survive this stage will have caused the messages carrying their id to travel to distance dis(1) and back on both sides, for a total of 4n2 dis(1) messages. The n − n2 entities that will not survive will cause at most three messages each (two “Forth” and one “Back”) to travel distance dis(1), for a total of 3(n1 − n2 ) dis(1) messages. Hence the ﬁrst stage will cost no more than

n 3n + n2 dis 1 ≤ 3n + dis(1)+1 dis 1 < n (3 dis 1 + 1 . To determine the total number of messages, we then need to know the total number k of stages. We know that a leader is elected as soon as the message with the smallest value makes a complete tour of the ring, that is, as soon as dis(i) is greater or equal to n. In other words, k is the smallest integer such that dis(k) ≥ n; such an integer is called the pseudo-inverse of n and denoted by dis−1 (n). So, the total number of messages used by protocol Control will be at most

M[Control] ≤ n

−1 (n) dis

i=1

dis(i) 3 + 1 + n, dis(i − 1)

where dis(0) = 1 and the last n messages are those for the ﬁnal notiﬁcation.

(3.12)

ELECTION IN RINGS

121

To really ﬁnalize the design, we must choose the function dis. Different choices will result in different performances. dis(i) = 2 (i.e., we double Consider, for example, the choice dis i = 2i−1 ; then dis(i−1) the distance every time) and dis −1 (n) = log n + 1, which in Expression 3.12 yields M[Control] ≤ 7 n log n + O(n), which is what we were aiming for: a O(n log n) worst case. The constant can be, however, further improved by carefully selecting dis. It is rather difﬁcult to determine the best function. Let us restrict the choice to among the functions where, like the one above, the ratio between consecutive values is constant, dis(i) that is, dis(i−1) = c. For these functions, dis−1 (n) = logc (n) + 1; thus, Expression 3.12 becomes 3c+1 log c n log n + O(n).

Thus, with all of them, protocol Control has a guaranteed O(n log n) performance. The “best” among those functions will be the one where 3c+1 log c is minimized; as distances must be integer quantities, also c must be an integer. Thus such a best choice is c = 3 for which we obtain M[Control] ≤ 6.309 n log n + O(n).

(3.13)

Time The ideal time complexity of procedure Control is easy to determine; the time required by stage i is the time needed by the message containing the smallest id to reach its assigned distance and come back to its originator; hence exactly 2dis(i) time units. An additional n time units are needed for the ﬁnal notiﬁcation, as well as for the initial wake-up of the entity with the smallest id. This means that the total time costs will be at most

T[Control] ≤ 2n +

−1 (n) dis

2 dis(i).

(3.14)

i=1

Again, the choice of dis will inﬂuence the complexity. Using any function of the form dis(i) = ci−1 , where c is a positive integer, will yield O(n) time. The determination of the best choice from the time costs point of view is left as an exercise. Electing Minimum Initiator () Let us use the strategy of electing a leader only among the initiators. Denote as usual by k the number of initiators. Let us analyze the worst case. In the analysis of protocol Control, we have seen that those that survive stage i contribute 4 dis(i) messages each to the cost, while those that do not survive contribute at most 3 dis(i) messages each. This is still true in the modiﬁed version Control-Minit;

122

ELECTION

what changes is the values of the number ni of entities that will start that stage. Initially, n1 = k . In the worst case, the k initiators are placed far enough from each other in the ring that each completes the stage without interfering with the others; if the distances between them are large enough, each can continue to go to higher stages without coming into contact with the others, thus, causing 4 dis(i) messages. For how many stages can this occur ? This can occur as long as dis(i) < kn+1 . That is, in the worst case, ni = k in each of the ﬁrst l = dis−1 kn+1 − 1 stages, and the cost will be 4 k dis(i) messages. In the following stages instead, the initiators will start interfering with each other, of survivors will follow the pattern andn1the number . of the general algorithm: ni ≤ dis(i−1)+1 Thus, the total number M[Control-Minit] of messages in the worst case will be at most M[Control-Minit] ≤ 4 k

l i=1

dis i + n

−1 (n) dis

i=l+1

dis(i) 3 +1 dis(i − 1)

+ n. (3.15)

3.3.4 Electoral Stages In the previous protocol, we have introduced and used the idea of limiting the distances to control the complexity of the original “as far as it can” approach. This idea requires that an entity makes several successive attempts (at increasing distances) to become a leader. The idea of not making a single attempt to become a leader (as it was done in All the Way and in AsFar), instead of proceeding in stages, is a very powerful algorithmic tool of its own. It allows us to view the election as a sequence of electoral stages : At the beginning of each stage, the “candidates" run for election; at the end of the stage, some “candidates" will be defeated, the others will start the next stage. Recall that “stage” is a logical notion, and it does not require the system to be synchronized; in fact, parts of the system may run very fast while other parts may be slow in their operation, so different entities might execute a stage at totally different times. We will now see how the proper use of this tool allows us to achieve even better results, without controlling the distances and without return (or feedback) messages. To simplify the presentation and the discussion, we will temporarily assume that there is Message Ordering (i.e., the links are FIFO); we will remove the restriction immediately after. As before, we will have each candidate send a message carrying its own id in both directions. Without setting an a priori ﬁxed limit on the distance these messages can travel, we still would like to avoid them to travel unnecessarily far (costing too many transmissions). The strategy to achieve this is simple and effective: A message will travel until it reaches another candidate in the same (or higher) stage.

ELECTION IN RINGS

123

The consequence of this simple strategy is that in each stage, a candidate will receive a message from each side; thus, it will know the ids of the neighboring candidate on each side. We will use this fact to decide whether a candidate x enters the next stage: x will survive this stage only if the two received ids are not smaller than its own id(x) (recall we are electing the entity with the smallest id); otherwise, it becomes defeated. As before, we will have defeated entities continue to participate by forwarding received messages. Correctness and termination are easy to verify. Observe that the initiator with the smallest identity will never become defeated; by contrast, at each stage, its message will transform into defeated the neighboring candidate on each side (regardless of their distance). Hence, the number of candidates decreases at each stage. This means that eventually, the only candidate left is the one with the minimum id. When this happens, its messages will travel all along the ring (forwarded by the defeated entities) and reach it. Thus, a candidate receiving its own messages back knows that all other entities are defeated; it will then become leader and notify all other entities of termination. Summarizing (see also Figure 3.13): A candidate x sends a message in both directions carrying its identity; these messages will travel until they encounter another candidate node. By symmetry, entity x will receive two messages, one from the “left" and one from the “right" (independently of any sense of direction); it will then become defeated if at least one of them carries an identity smaller than its own; if both the received identities are larger than its own, it starts the next stage; ﬁnally, if the received identities are its own, it becomes leader and notiﬁes all entities of termination. A defeated node will forward any received election message, and each noninitiator will automatically become defeated upon receiving an election message. The protocol is shown in Figure 3.14, where close and open denote the operation of closing a port (with the effect of enqueueing incoming messages) and opening a closed port (dequeueing the messages), respectively, and where procedure Initialize is shown in Figure 3.15.

x

x

x

y

x

x > Min{y,z} => x defeated x < Min{y,z} => x candidate next stage x = Min{y,z} => x leader

FIGURE 3.13: A candidate x in an electoral stage.

z

124

ELECTION

PROTOCOL Stages.

States: S = {ASLEEP, CANDIDATE, WAITING, DEFEATED, FOLLOWER, LEADER}; SINIT = {ASLEEP}; STERM = {FOLLOWER, LEADER}.

Restrictions: IR ∪Ring. ASLEEP

Spontaneously begin INITIALIZE; become CANDIDATE; end Receiving("Election", id*, stage*) begin INITIALIZE; min:= Min(id*,min); close(sender); become WAITING; end CANDIDATE Receiving("Election", id*, stage*) begin if id* = id(x) then min:= Min(id*,min); close(sender); become WAITING; else send(Notify) to N (x); become LEADER; end WAITING Receiving("Election", id*, stage*) open(other); stage:= stage+1; min:= Min(id*,min); if min= id(x) then send("Election", id(x), stage) to N (x); become CANDIDATE; else become DEFEATED; endif end DEFEATED Receiving() begin send() to other; if = Notify then become FOLLOWER endif; end

FIGURE 3.14: Protocol Stages.

Messages It is not so obvious that this strategy is more efﬁcient than the previous one. Let us ﬁrst determine the number of messages exchanged during a stage. Consider the segment of the ring between two neighboring candidates in stage i, x, and

ELECTION IN RINGS

125

Procedure INITIALIZE begin stage:= 1; count:= 0; min:= id(x); send("Election", id(x), stage) to N (x); end

FIGURE 3.15: Procedure Initialize used by protocol Stages.

y = r(i, x); in this stage, x will send a message to y and y will send one to x. No other messages will be transmitted during this stage in that segment. In other words, on each link, only two messages will be transmitted (one in each direction) in this stage. Therefore, in total, 2n message exchanges will be performed during each stage. Let us determine now the number of stages. Consider a node x that is candidate at the beginning of stage i and is not defeated during this stage; let y = r(i, x) and z = l(i, x) be the ﬁrst entity to the right and to the left of x, respectively, that are also candidates in stage i (Figure 3.16). It is not difﬁcult to see that if x survives stage i, both r(i, x) and l(i, x) will be defeated. Therefore, at least half of the candidates are defeated at each stage. In other words, at most half of them survive: ni ≤

ni−1 2 .

As n1 = n , the total number of stages is at most σStages ≤ log n + 1. Combining the two observations, we obtain, M[Stages] ≤ 2 n log n + O(n).

(3.16)

That is, protocol Stages outperforms protocol Control. Observe that equality is achievable in practice (Exercise 3.10.9). Further note that if we use the Minimum Initiator approach the bound will become M[Stages:Minit] ≤ 2 n log k∗ + O(n).

x

l(i,x)

defeated

(3.17)

r(i,x)

candidate

FIGURE 3.16: If x survives this stage, its neighboring candidates will not.

Removing Message Ordering The correctness and termination of Stages are easy to follow also because we have assumed in our protocol that there is Message

126

ELECTION

Ordering. This assumption ensured that the two messages received by a candidate in stage i are originated by candidates also in stage i. If we remove the Message Ordering restriction, it is possible that messages arrive out of order and that a message sent in stage j > i arrives before a message sent in stage i. Simple Approach The simplest way to approach this problem is by enforcing the “effects” of Message Ordering, without really having it. 1. First of all, each message will also carry the stage number of the entity originating it. 2. When a candidate node x in stage i receives a message M∗ with stage j > i, it will not process it but will locally enqueue it until it has received from that side (and processed) all the messages from stages i, i + 1, . . . , j − 1, which have been “jumped over” by M∗; it will then process M∗. The only modiﬁcation to protocol Stages as described in Figure 3.14 is the addition of the local enqueueing of messages (Exercise 3.10.6); as this is only local processing, the message and time costs are unchanged. Stages∗ An alternative approach is to keep a track of a message “jumping over” others but without enqueueing it locally. We shall describe it in some details and call Stages* the corresponding protocol. 1. First of all, we will give a stage number to all the nodes: For a candidate entity, it is the current stage; for a defeated entity, it is the stage in which it was defeated. We will then have a defeated node forward only messages from higher stages. 2. A candidate node x in stage i receiving an Election message M∗ with stage j > i will use the id included in the message, id*, and will make a decision about the outcome of the stage i as if both of them were in the same stage. • If x is defeated in this round, then it will forward the message M∗. • If x survives, it means that id(x) is smaller not only than id* in M∗ but also than the ids in the messages “jumped over” by M∗ (Exercise3.10.13). In this case, x can act because it has received already from that side all the messages from stages i, i + 1, . . . , j , and they all have an id larger than id(x). We will indicate this fact by saying that x has now a credit of j − i messages on that port. In other words, if a candidate x has a credit c > 0 associated with a port, it does not have to wait for a message from that port during the current stage. Clearly, the credit must be decreased in each stage. To write the set of rules for protocol Stages* is a task that, although not difﬁcult, requires great care and attention to details (Exercise 3.10.12); similar characteristics has the task of proving the correctness of the protocol Stages* (Exercise 3.10.14). As for the resulting communication complexity, the number of messages is never more (sometimes less) than that with Message Ordering (Exercise 3.10.15).

ELECTION IN RINGS

127

Interestingly, if we attempt to measure the ideal time complexity, we will only see executions with Message Ordering. In other words, the phenomenon of messages delivered out of order will disappear. This is yet another case showing how biased and limited (and thus dangerous) ideal time is as a cost measure. 3.3.5 Stages with Feedback We have seen how, with the proper use of electoral stages in protocol Stages, we can obtain a O(n log n) performance without the need of controlling the distance travelled by a message. In addition to controlled distances, protocol Control uses also a “feedback” technique: If a message successfully reaches its target, it returns back to its originator, providing it with a “positive feedback” on the situation it has encountered. Such a technique is missing in Stages: A message always successfully reaches its target (the next candidate in the direction it travels), which could be at an unpredictable distance; however, the use of the message ends there. Let us integrate the positive feedback idea in the overall strategy of Stages: When an “Election” message reaches its target, a positive feedback will be sent back to its originator if the id contained in the message is the smallest seen by the target in this stage. More precisely, when a candidate x receives Election messages containing id(y) and id(z) from its neighboring candidates, y = r(i, x) and z = l(i, x), it will send a (positive) “feedback” message: to y if id(y) < Min{id(x), id(z)}, to z if id(z) < Min{id(x), id(y)}, and to none otherwise. A candidate will then survive this stage and enter the new one if and only if it receives a feedback from both sides. In the example of Figure 3.17, candidates with ids 2, 5, and 8 will not send any feedback; of these three, only candidate with id 2 will enter next stage. The fate of entity with id 7 depends on its other neighboring candidate, which is not shown; so, we do not know whether it will survive or not. If a node sends a “feedback” message, it knows that it will not survive this stage. This is the case, for example, of the entities with ids 6, 9, 10, and 11. Some entities, however, do not send any “feedback” and wait for a “feedback” that will never arrive; this is, for example, the case of the entities with ids 5 and 8. How will such an entity discover that no “feedback” is forthcoming and it must become defeated? The answer is fortunately simple. Every entity that survives stage i (e.g., 7

9

8

10

defeated

2

6

candidate

FIGURE 3.17: Only some candidates will send a feedback.

5

11

128

ELECTION

the node with id 2) will start the next stage; its Stage message will act as a negative feedback for those entities receiving the message while still waiting in stage i. More speciﬁcally, if while waiting for a “feedback” message in stage i, an entity receives an “Election” message (clearly with a smaller id) in stage i + 1, it becomes defeated and forwards the message. We shall call the protocol Stages with Feedback; our description was assuming message ordering. As for protocol Stages, this restriction can and will be logically enforced with just local processing. Correctness The correctness and termination of the protocol follows from the fact that the entity xmin with the smallest identity will always receive a positive feedback from both sides; hence, it will never be defeated. At the same time, xmin never sends a positive feedback; hence, its left and right neighboring candidates in that stage do not survive it. In other words, the number ni of candidates in stage i is monotonically decreasing, and eventually only xmin will be in such a state. When this happens, its own “Election” messages will travel along the ring, and termination will be detected. Messages We are adding bookkeeping and additional messages to the already highly efﬁcient protocol Stages. Let us examine the effect of these changes. Let us start with the number of stages. As in Stages, if a candidate x in stage i survives, it is guaranteed that its neighboring candidates in the same stage, r(i, x) and l(i, x), will become defeated. With the introduction of positive feedback, we can actually guarantee that if x survives, neither will the ﬁrst candidate to the right of r(i, x) survive nor will the ﬁrst candidate to the left of l(i, x) survive. This is because if x survives, it must have received a “feedback” from both r(i, x) and l(i, x) (see Figure 3.18). But if r(i, x) sends “feedback” to x, it does not send one to its neighboring candidate r 2 (i, x); similarly, l(i, x) does not send a “Feedback” to l 2 (i, x). In other words, ni ≤

ni−1 3 .

That is, at most one third of the candidates starting a stage will enter the next one. As n1 = n , the total number of stages is at most σStages ≤ log3 n + 1. Note that there are initial conﬁgurations of the ids that will force the protocol to have exactly these many stages (Exercise 3.10.22).

l 2(i,x)

x

l(i,x)

defeated

r(i,x)

candidate

FIGURE 3.18: If x survives, those other candidates do not.

r2(i,x)

ELECTION IN RINGS

129

In other words, the number of stages has decreased with the use of “feedback” messages. However, we are sending more messages in each stage. Let us examine now how many messages will be sent in each stage. Consider stage i; this will be started by ni candidates. Each candidate will send an “Election” message that will travel to the next candidate on either side. Thus, exactly like in Stages, two “Election” messages will be sent over each link, one in each direction, for a total of 2n “Election” messages per stage. Consider now the “feedback” messages; a candidate sends at most one “feedback” and only in one direction. Thus, in the segment of the ring between two candidates, there will be at most one “feedback” message on each link; hence, there will be no more than n “feedback” transmissions in total in each stage. This means that in each stage there will be at most 3n messages. Summarizing, M[StagesFeedback] ≤ 3 n log3 n + O(n) ≤ 1.89 n log n + O(n).

(3.18)

In other words, the use of feedback with the electoral stages allows us to reduce the number of messages in the worst case. The use of Minimum Initiator strategy yields the similar result: M[StagesFeedback–Minit] ≤ 1.89 n log k∗ + O(n).

(3.19)

In the analysis of the number of “feedback” messages sent in each stage, we can be more accurate; in fact, there are some areas of the ring (composed of consecutive defeated entities between two successive candidates) where no feedback messages will be transmitted at all. In the example of Figure 3.17, this is the case of the area between the candidates with ids 8 and 10. The number of these areas is exactly equal to the number ni+1 of candidates that survive this stage (Exercise 3.10.19). However, the savings are not enough to reduce the constant in the leading term of the message costs (Exercise 3.10.21). Granularity of Analysis: Bit Complexity The advantage of protocol Stages with Feedback becomes more evident when we look at communication costs at a ﬁner level of granularity, focusing on the actual size of the messages being used. In fact, while the “Election” messages contain values, the “feedback” messages are just signals, each containing O(1) bits. (Recall the discussion in Section 3.2.) In each stage, only the 2n “Election” messages carry a value, while the other n are signals; hence, the total number of bits transmitted will be at most 2 n (c + log id) log3 n + n c log3 n + l.o.t., where id denotes the largest value sent in a message, c = O(1) denotes the number of bits required to distinguish among the different types of message, and l.o.t. stands for “lower order terms.” That is, B[StageswithFeedback] ≤ 1.26 n log n log id + l.o.t.

(3.20)

130

ELECTION

The improvement on the bit complexity of Stages, where every message carries a value, is, thus, in the reduction of the constant from 2 to 1.26. Further Improvements? The use of electoral stages allows us to transform the election process into one of successive “eliminations,” reducing the number of candidates at each stage. In the original protocol Stages, each surviving candidate will eliminate its neighboring candidate on each side, guaranteeing that at least half of the candidates are eliminated in each stage. By using feedback, protocol Stages with Feedback extends the “reach” of a candidate also to the second neighboring candidate on each side, ensuring that at least two third of the candidates are eliminated in each stage. Increasing the “reach” of a candidate during a stage will result in a larger proportion of the candidates in each stage, thus, reducing the number of stages. So, intuitively, we would like a candidate to reach as far as possible during a stage. Obviously the price to be paid is the additional messages required to implement the longer reach. In general, if we can construct a protocol that guarantees a reduction rate of at least b, that is, ni ≤ ni−1 b , then the total number of stages would be logb (n); if the messages transmitted in each stage are at most an, then the overall complexity will be a a n logb (n) = n log n. log b To improve on Stages with Feedback, the reduction must be done with a number of messages such that loga b < 1.89. Whether this is possible or not is an open problem (Problem 3.10.3). 3.3.6 Alternating Steps It should be clear by now that the road to improvement, on which creative ingenuity will travel, is oftentimes paved by a deeper understanding of what is already available. A way to achieve such an understanding is by examining the functioning of the object of our improvement in “slow motion,” so as to observe its details. Let us consider protocol Stages. It is rather simple and highly efﬁcient. We have already shown how to achieve improvements by extending the “reach” of a candidate during a stage; in a sense, this was really “speeding up” the functioning of the protocol. Let us examine now Stages instead by “slowing down” its functioning. In each stage, a candidate sends its id in both directions, receives an id from each direction, and decides whether to survive, be elected, or become defeated on the basis of its own value and the received ones. Consider the example shown in Figure 3.19; the result of stages will result in candidates w, y, and v being eliminated and x and z surviving; the fate of u will depend on its right candidate neighbor, which is not shown. We can obviously think of “sending in both directions” as two separate steps: send to one direction (say “right”) and send to the other. Assume for the moment that the ring is oriented: “right” has the same meaning for all entities. Thus, the stage can be thought of having two steps: (1) The candidate sends to the “right” and receives from the “left”; (2) it will then send to the “left” and receive from the “right.”

ELECTION IN RINGS

8

7

9

3

10

6

w

x

y

z

v

u

defeated

131

candidate

FIGURE 3.19: Alternating Steps: slowing down the execution of Stages.

Consider the ﬁrst step in the same example as shown in Figure 3.19; both candidates y and v already know at this time that they would not survive. Let us take advantage of this “early” discovery. We will use each of these two steps to make an electoral decision, and we will eliminate a candidate after step (1) if it receives a smaller id in this step. Thus, a candidate will perform step (2) only if it is not eliminated in step (1). The advantage of doing so becomes clear observing that by eliminating candidates in each step of a phase, we eliminate more than that in the original phase; in the example of Figure 3.19, also x will be eliminated. Summarizing, the idea is that at each step, a candidate sends only one message with its value, waits for one message, and decides on the basis of its value and the received one; the key is to alternate at each step the direction in which messages are sent. This protocol, which we shall call Alternate, is shown in Figure 3.20, where close and open denote the operation of closing a port (with the effect of enqueueing incoming messages) and opening a closed port (dequeueing the messages), respectively; and the procedures Initialize and Process Message are shown in Figure 3.21. Correctness The correctness of the protocol follows immediately from observing that, as usual, the candidate xmin with the smallest value will never be eliminated and that, on the contrary, it will in each step eliminate a neighboring candidate. Hence, the number of candidates is monotonically decreasing in the steps; when only xmin is left, its message will complete the tour of the ring transforming it into the leader. The ﬁnal notiﬁcation will ensure proper termination of all entities. Costs To determine the cost is slightly more complex. There are exactly n messages transmitted in each step, so we need to determine the total number of steps σAlternate (or, where no confusion arises, simply σ ) until a single candidate is left, in the worst case, regardless of the placement of the ids in the ring, time delays, and so forth. Let ni be the candidate entities starting step i; clearly n1 = n and nσ = 1. We know that two successive steps of Alternate will eliminate more candidates than a single stage of Stages; hence, the total number of steps will be less than twice the number of stages of Stages: σ < 2 log n. We can, however, be more accurate regarding the amount of elimination performed in two successive steps.

132

ELECTION

PROTOCOL Alternate.

States: S = {ASLEEP, CANDIDATE, DEFEATED, FOLLOWER, LEADER}; SINIT = {ASLEEP}; STERM = {FOLLOWER, LEADER}.

Restrictions: IR ∪OrientedRing ∪ MessageOrdering. ASLEEP Spontaneously begin INITIALIZE; become CANDIDATE; end Receiving("Election", id*, step*) begin INITIALIZE; become CANDIDATE; PROCESS MESSAGE; end CANDIDATE Receiving("Election", id*, step*) begin if id* = id(x) then PROCESS MESSAGE; else send(Notify) to N(x); become LEADER; end DEFEATED Receiving() begin send() to other; if = Notify then become FOLLOWER endif; end

FIGURE 3.20: Protocol Alternate.

Assume that in step i, the direction is “right” (thus, it will be “left” in step i + 1). Let di denote the number of candidates that are eliminated in step i. Of those ni candidates that start step i, di will be defeated and only ni+1 will survive that step. That is, ni = di + ni+1 Consider a candidate x that survives both step i and step i + 1. First of all observe that the candidate to the right of x in step i will be eliminated in that step. (If not, it would mean that its id is smaller than id(x) and thus would eliminate x in step i + 1; but we know that x survives.) This means that every candidate that, like x, survives both stages will eliminate one candidate in the ﬁrst stage; in other words, di ≥ ni+2 ,

ELECTION IN RINGS

133

Procedure INITIALIZE begin step:= 1; min:= id(x); send("Election", id(x), step) to right; close(right); end Procedure PROCESS MESSAGE begin if id*< min then open(other); become DEFEATED; else step:= step+1; send("Election", id(x), step) to sender; close(sender); open(other); endif end

FIGURE 3.21: Procedures used by protocol Alternate.

but then ni ≥ ni+1 + ni+2 .

(3.21)

The consequence of this fact is very interesting. In fact, we know that nσ = 1 and, obviously, nσ −1 ≥ 2. From Equation 3.21, we have nσ −i ≥ nσ −i+1 + nσ −i+2 . Consider now the Fibonacci numbers Fj deﬁned by Fj = Fj +1 + Fj +2 , where F−1 = 0 and F0 = 1. Then, clearly nσ −i ≥ Fi+1 . It follows that n1 ≥ Fσ , but n1 = n; thus σ is the index of the largest Fibonacci number not exceeding n. This helps us in achieving our goal of determining σ , the 1+√5 j number of steps until there is only one candidate left. As Fj = b , where b 2 is a positive constant, we have n ≥ Fσ = b

√ σ 1+ 5 2

from where we get, σAlternate ≤ 1.44 log n + O(1). That means that after at most so many steps, there will be only one candidate left. Observe that what we have derived is actually achievable. In fact, there are allocations of the ids to the nodes or a ring that will force the protocol to perform σAlternate steps before there is only one candidate left (Exercise 3.10.26). In the next step, this

134

ELECTION

candidate will become leader and start the notiﬁcation. These last two operations require n messages each. Thus the total number of messages will be M[Alternate] ≤ 1.44 n log n + O(n).

(3.22)

In other words, protocol Alternate is not only simple but also more efﬁcient than all other protocols seen so far. Recall, however, that it has been described and analyzed assuming that the ring is oriented. Question. What happens if the ring is not oriented ? If the entities have different meaning for “right,” when implementing the ﬁrst step, some candidates will send messages clockwise while others in a counterclockwise direction. Notice that in the implementation for oriented rings described above, this would lead to deadlock because we close the port we are not waiting to receive from; the implementation can be modiﬁed so that the ports are never closed (Exercise 3.10.24). Consider this to be the case. It will then happen that a candidate waiting to receive from “left” will instead receive from “right.” Call this situation a conﬂict. What we need to do is to add to the protocol a conﬂict resolution mechanism to cope with such situations. Clearly this complicates the protocol (Problem 3.10.2). 3.3.7 Unidirectional Protocols The ﬁrst two protocols we have examined, All the Way and AsFar, did not really require the restriction Bidirectional Links; in fact, they can be used without any modiﬁcation in a directed or a unidirectional ring. The subsequent protocols Distances, Stages, Stages with Feedback, and Alternate all used the communication links in both directions, for example, for obtaining feedback. It was through them that we have been able to reduce the costs from O(n2 ) to a guaranteed O(n log n) messages. The immediate and natural question is as follows: Question. Is “Bidirectional Links” necessary for a O(n log n) cost ? The question is practically relevant because if the answer is positive, it would indicate that an additional investment in communication hardware (i.e., full duplex lines) is necessary to reduce the operating costs of the election task. The answer is important also from a theoretical point of view because if positive, it would clearly indicate the “power” of the restriction Bidirectional Links. Not surprisingly, this question has attracted the attention of many researchers. We are going to see now that the answer is actually No.

ELECTION IN RINGS

135

We are also going to see that, strangely enough, we know how to do better with unidirectional links than with bidirectional ones. First of all, we are going to show how the execution of protocols Stages and Alternate can be simulated in unidirectional links yielding the same (if not better) complexity. Then, using the lessons learned in this process, we are going to develop a more efﬁcient unidirectional solution. Unidirectional Stages What we are going to do is to show how to simulate the with the same message costs. execution of protocol Stages in unidirectional rings R, Consider how protocol Stages works. In a stage, a candidate entity x 1. sends a message carrying a value (its id) in both directions and thus receives a message with the value (the id) of another candidate from each directions, and then, 2. on the basis of these three values (i.e., its own and the two received ones), makes a decision on whether it (and its value) should survive this stage and start the next stage. Let us implement each of these two steps separately. Step (1) is clearly the difﬁcult one because, in a unidirectional ring, messages can only be sent in one direction. Decompose the operation “send in both directions” into two substeps: (I) “send in one direction” and then (II) “send in the other direction.” as a result, every candidate will receive Substep (I) can be executed directly in R; a message with the value of its neighboring candidate from the opposite direction (see Figure 3.22 c). The problem is in implementing now substep (II); as we cannot send information in the other direction, we will send information again in the same direction, and, as it is meaningless to send again the same information, we will send the information we just received. As a result, every candidate will receive now the value of another candidate from the opposite direction (see Figure 3.22d). has now three values at its disposal: the one it started with plus Every entity in R the two received ones. We can now proceed to implement Step (2). To simulate the bidirectional execution, we need that a candidate decides on whether to survive or to as in the bidirectional become passive on the basis of exactly the same information in R case. Consider the initial conﬁguration in the example shown in Figure 3.22 and focus on the candidate x with starting value 7; in the bidirectional case, x decides that the value 7 should survive on the basis of the information: 7, 15, and 8. In the unidirectional case, after the implementation of Step (1), x knows now 4 and 15 in addition to 7. This is not the same information at all. In fact, it would lead to totally different decisions in the two cases, destroying the simulation. a candidate that, at the end of Step (1), has exactly the There is, however, in R same information that x has at the end of Step (1) in the bidirectional case: This is the candidate that started with value 8. In fact, the information available in R exists in R (compare carefully Figures 3.22 (b) and (d)), but it is shifted to the “next” candidate as in R; they in the ring direction. It is, thus, possible to make the same decisions in R will just have to be made by different entities in the two cases.

136

ELECTION

8

5 11

8

11 5

5 8 7

7

9

8

5 11 9 11

7

9

15

12

15

12

12 9 4

7 15 4

4

4 15

(a)

7

(b)

8 5 8

11

7 15 8

5

9 11

15 7

12

15

9

4

15 4

7 8 5

7

(c)

8 11 5

5

9

11

12 11 9

4 15 12

4 12

12

4 12

9

(d)

FIGURE 3.22: (a) Initial conﬁguration; (b) information after the ﬁrst full stage of Stages with Bidirectional Links; (c) information after ﬁrst substep in the unidirectional simulation; (d) information after the second substep.

In each stage, a candidate makes a decision on a value. In protocol Stages, this value was always the candidate’s id. In the unidirectional algorithm, this value is not the id; it is the ﬁrst value sent by its neighboring candidate in Step (1). We will call this value the envelope. IMPORTANT. Be aware that unless we add the assumption Message Ordering, it is possible that the second value arrives before the envelope. This problem can be solved (e.g., by locally enqueueing out-of-order messages). It is not difﬁcult to verify that the simulation is exact: In each stage, exactly the as in R; thus, the number of stages is exactly the same. same values survive in R

ELECTION IN RINGS

137

PROTOCOL UniStages.

States: S = {ASLEEP, CANDIDATE, DEFEATED, FOLLOWER, LEADER}; SINIT = {ASLEEP}; STERM = {FOLLOWER, LEADER}.

Restrictions: IR ∪U nidirectionalRing. ASLEEP Spontaneously begin INITIALIZE; become CANDIDATE; end Receiving("Election", value*, stage*,order*) begin send ("Election", value*, stage*, order*); become DEFEATED; end CANDIDATE Receiving("Election", value*, stage*, order*) begin if value* = value1 then PROCESS MESSAGE; else send(Notify); become LEADER; end DEFEATED Receiving() begin send(); if = Notify then become FOLLOWER endif; end

FIGURE 3.23: Protocol UniStages.

The cost of each stage is also the same: 2n messages. In fact, each node will send (or forward) exactly two messages. In other words, M[UniStages] ≤ 2 n log n + O(n).

(3.23)

This shows that O(n log n) guaranteed message costs can be achieved in ring networks also without Bidirectional Links. The corresponding protocol UniStages is shown in Figure 3.23, described not as a unidirectional simulation of Stages (which indeed it is) but directly as a unidirectional protocol. NOTES. In this implementation, 1. we elect a leader only among the initiators (using approach Minimum Initiator); 2. Message Ordering is not assumed; within a stage, we use a Boolean variable, in order to distinguish between value and envelope and to cope with messages

138

ELECTION

from different stages arriving out of order: If a candidate receives a message from the “future” (i.e., with a higher stage number), it will be transformed immediately into defeated and will forward the message. Unidirectional Alternate We have shown how to simulate Stages in a unidirectional ring, achieving exactly the same cost. Let us focus now on Alternate; this protocol makes full explicit use of the full duplex communication capabilities of the bidirectional ring by alternating direction at each step. Surprisingly, it is possible to . achieve an exact simulation also of this protocol in a unidirectional ring R Consider how protocol Alternate works. In a “left” step, 1. a candidate entity x sends a message carrying a value v(x) to the “left”, and receives a message with the value of another candidate from the “right”;

Procedure INITIALIZE begin stage:= 1; count:= 0; order:= 0; value1:= id(x); send("Election", value1, stage, order); end Procedure PROCESS MESSAGE begin if stage* = stage then if order* = 0 then envelope:= value*; order:= 1; send ("Election", value*, stage*, order); else value2:= value*; endif count:=count+1; if count=2 then if envelope < Min(value1, value2) then order:= 0; count:= 0; stage:= stage+1; value1:= envelope; send ("Election", value1, stage, order); else become DEFEATED; endif endif else if stage* > stage then send ("Election", value*, stage*, order*); become DEFEATED; endif endif end

FIGURE 3.24: Procedures used by protocol UniStages.

ELECTION IN RINGS

13 5 7

139

9 5

5

13

11 11

7

9

5

7

9 8

15

12 9

15 8

8

8 12

7

(a)

(b)

13 5

7 13

7 5

5 11

15 7

9 11

8

9

7

5

12

15

9

8 8

8

12 (c)

(d)

FIGURE 3.25: (a-b) Information after (a) the ﬁrst step and (b) the second step of Alternate in an oriented bidirectional ring. (c-d) Information after (c) the ﬁrst step and (d) the second step of the unidirectional simulation.

2. on the basis of these two values (i.e., its own and the received one), x makes a decision on whether it (and its value) should survive this step and start the next step. The actions in a “right” step are the same except that “left” and “right” are interchanged. shown in Figure 3.25, and assume we can send messages only Consider the ring R to “right”. This means that the initial “right” step can be trivially implemented: Every entity will send a value (its own) and receive another; it starts the next step if and only if the value it receives is not smaller that its own.

140

ELECTION

Let us concentrate on the “left” step. As a candidate cannot send a value to the left, it will have to send the value to the “right”. Let us do so. Every candidate in R has now two values at its disposal: the one it started with and the received one. To simulate the bidirectional execution, we need that a candidate makes a decision on whether to survive or to become passive on the basis of exactly the same information as in the bidirectional case. Consider the initial conﬁguration in the example in R shown in Figure 3.25. First of all observe that the information in the “right” step is the same both in the bidirectional (a) and in the unidirectional (c) case. The differences occur in the “left” step. Focus on the candidate x with starting value 7; in the second step of the bidirectional case, x decides that the value 7 should not survive on the basis of the information: 5 and 7. In the unidirectional case, after the second step, x knows now 7 and 8. This is not the same information at all. In fact, it would lead to totally different decisions in the two cases, destroying the simulation. a candidate that, at the end of the second step, has exactly the There is, however, in R same information that x has in the bidirectional case: This is the candidate that started with value 5. As we have seen already in the simulation of Stages, the information (compare carefully Figures 3.25(b) and (d)). It is, thus, available in R exists in R as in R; they will just have to be made by possible to make the same decisions in R different entities in the two cases. Summarizing, in each step, a candidate makes a decision on a value. In protocol Alternate, this value was always the candidate’s id. In the unidirectional algorithm, this value changes depending on the step. Initially, it is its own value; in the “left” step, it is the value it receives; in the “right” step, it is the value it already has. In other words, 1. in the “right” step, a candidate x survives if and only if the received value is larger than v(x); 2. in the “left” step, a candidate x survives if and only if the received value is smaller than v(x), and if so, x will now play for that value. Working out a complete example will help clarify the simulation process and dispel any confusion (Exercise 3.10.33). IMPORTANT. Be aware that unless we add the assumption Message Ordering, it is possible that the value from step i + 1 arrives before the value for step i. It is not difﬁcult to verify that the simulation is exact: In each step, exactly the as in R; thus, the number of steps is exactly the same. The same values survive in R cost of each step is also the same: n messages. Thus, M[UniAlternate] ≤ 1.44 n log n + O(n).

(3.24)

The unidirectional simulation of Alternate is shown in Figure 3.26; it has been simpliﬁed so that we elect a leader only among the initiators, and assuming Message

ELECTION IN RINGS

141

PROTOCOL UniAlternate.

States: S = {ASLEEP, CANDIDATE, DEFEATED, FOLLOWER, LEADER}; SINIT = {ASLEEP}; STERM = {FOLLOWER, LEADER}.

Restrictions: IR ∪U nidirectionalRing ∪ MessageOrdering. ASLEEP Spontaneously begin INITIALIZE; become CANDIDATE; end Receiving("Election", value*, stage*,order*) begin send ("Election", value*, stage*, order*); become DEFEATED; end CANDIDATE Receiving("Election", value*, stage*) begin if value* = value then PROCESS MESSAGE; else send(Notify); become LEADER; end DEFEATED Receiving() begin send(); if = Notify then become FOLLOWER endif; end

FIGURE 3.26: Protocol UniAlternate.

Ordering. The protocol can be modiﬁed to remove this assumption without changes in its cost (Exercise 3.10.34). The procedures Initialize and Prepare Message are shown in Figure 3.27. An Alternative Approach In all the solutions we have seen so far, both for unidirectional and bidirectional rings, we have used the same basic strategy of minimum ﬁnding; in fact in all of the protocols so far, we have elected as a leader the entity with the smallest value (either among all the entities or among just the initiators). Obviously, we could have used maximum ﬁnding in those solution protocols, just substituting the function Min with Max and obtaining the exact same performance. A very different approach consists in mixing these two strategies. More precisely, consider the protocols based on electoral stages. In all of them, what we could do is to alternate strategy in each stage: In “odd” stages we use the function Min, and in “even” stages we use the function Max. Call this approach min-max.

142

ELECTION

Procedure INITIALIZE begin step:= 1; direction:= "right"; value:= id(x); send("Election", value, step, direction); end Procedure PROCESS MESSAGE begin if direction = "right" then if value < value* then step:= step+1; direction:= "left"; send ("Election", value, step, direction); else become DEFEATED; endif else if value > value* then step:= step+1; direction:= "right"; send ("Election", value, step, direction); else become DEFEATED; endif endif end

FIGURE 3.27: Procedures used by protocol UniAlternate.

It is not difﬁcult to verify that all the stage-based protocols we have seen so far, both bidirectional and unidirectional, still correctly solve the election problem; moreover, they do so with the same costs as before (Exercises 3.10.11, 3.10.23, 3.10.28, 3.10.31, 3.10.36). The interesting and surprising thing is that this approach can lead to the design of a more efﬁcient protocol for unidirectional rings. The protocol we will construct has a simple structure. Let us assume that every entity starts and that there is Message Ordering (we will remove both assumptions later). 1. Each initiator x becomes candidate, prepares a message containing its own value id(x) and the stage number i = 1, and sends it (recall, we are in a unidirectional ring, so there is only one out-neighbor); x is called the originator of this message and remembers its content. 2. When a message with value b arrives at a candidate y, y compares the received value b with the value a it sent in its last message. (a) If a = b, the message originated by y has made a full trip around the ring; y becomes the leader and notiﬁes all other entities of termination. (b) If a = b, the action y will take depends on the stage number j : (i) if j is “even,” the message is discarded if and only if a < b (i.e., b survives only if max);

ELECTION IN RINGS

(9, 2)

(11, 2) 11

(10, 2) 10

(20, 2) 20

(22, 2) 22

143

(13, 2) 13

(a)

(12, 3)

(11, 3)

(22, 3)

11

22 (b)

(21, 4)

(11, 4) 11 (c)

FIGURE 3.28: Protocol MinMax: (a) In an even stage, a candidate survives only if it receives an envelope with a larger value; (b) it then generates an envelope with that value and starts the next stage; (c) in an odd stage, a candidate survives only if it receives an envelope with a smaller value; if so, it generates an envelope with that value and starts the next stage.

(ii) if j is “odd,” the message is discarded if and only if a > b (i.e., b survives only if min). If the message is discarded, y becomes defeated; otherwise, y will enter the next stage: Originate a message with content (b, j + 1) and send it. 3. A defeated entity will, as usual, forward received messages. For example, see Figure 3.28. The correctness of the protocol follows from observing that, (a) in an even stage i, the candidate x receiving the largest of all values in that stage, vmax (i), will survive and enter the next stage; by contrast, its “predecessor” l(i, x) that originated that message will become defeated (Exercise 3.10.37), and (b) in an odd stage j , the candidate y receiving the smallest of all values in that stage, vmin (j ), will survive and enter the next stage; furthermore, its “predecessor” l(j, y) that originated that message will become defeated. In other words, in each stage at least one candidate will survive that stage, and the number of candidates in a stage is monotonically decreasing with the number of stages. Thus, within ﬁnite time, there will be only one candidate left; when that happens, its message returns to it transforming it into a leader.

144

ELECTION

IMPORTANT. Note that the entity that will be elected leader will be neither the one with the smallest value nor the one with the largest value. Let us now consider the costs of this protocol, which we will call MinMax. In a stage, each candidate sends a message that travels to the next candidate. In other words, in each stage there will be exactly n messages. Thus, to determine the total number of messages, we need to compute the number σMinMax of stages. We can rephrase the protocol in terms of values instead of entities. Each value sent in a stage j travels from its originator to the next candidate in stage j . Of all these values, only some will survive and will be sent in the next stage: In an even stage, a value survives if it is larger than its “successor” (i.e., the next value in the ring in also this stage); similarly, in an odd stage, it survives if it is smaller than its successor. Let ni be the number of values in stage i; of those, di will be discarded and ni+1 will be sent in the next stage. That is, ni+1 = ni − di . Let i be an odd (i.e., min) stage, and let value v survive this stage; this means that the successor of v in stage i, say u, is larger than v that is, u >v. Let v survive also stage i + 1 (an even, i.e., max, stage). This implies v must have been discarded in stage i: If not, the entity that originates the message (i + 1, u) would discard (i + 1, v) because u > v, but we know that x survives this stage. This means that every value that, like v, survives both stages will eliminate one value in the ﬁrst of the two stages; in other words, ni+2 ≤ di , but then ni ≥ ni+1 + ni+2 .

(3.25)

Notice that this is exactly the same equation as the one (Equation 3.21) we derived for protocol Alternate. We thus obtain that σMinMax ≤ 1.44 log n + O(1). After at most these many stages, there will be only one value left. Observe that this bound we have derived is actually achievable. In fact, there are allocations of the ids to the nodes or a ring, which will force the protocol to perform σMinMax steps before there is only one value left (Exercise 3.10.38). The candidate sending this value will receive its message back and become leader; it will then start the notiﬁcation. These last two steps require n messages each; thus the total number of messages will be M[MinMax] ≤ 1.44 n log n + O(n).

(3.26)

ELECTION IN RINGS

145

PROTOCOL MinMax

States: S = {ASLEEP, CANDIDATE, DEFEATED, FOLLOWER, LEADER}; SINIT = {ASLEEP}; STERM = {FOLLOWER, LEADER}.

Restrictions: IR ∪U nidirectionalRing ∪ MessageOrdering. ASLEEP Spontaneously begin stage:= 1; value:= id(x); send("Envelope", value, stage); become ORIGINATOR; end Receiving("Envelope", value*, stage*) begin send ("Envelope", value*, stage*); become DEFEATED; end CANDIDATE Receiving("Envelope", value*, stage*) begin if value* = value then PROCESS ENVELOPE; else send(Notify); become LEADER; end

DEFEATED Receiving("Envelope", value*, stage*) begin send("Envelope", value*, stage*); end Receiving("Notify") begin send ("Notify"); become FOLLOWER; end

FIGURE 3.29: Protocol MinMax.

In other words, we have been able to obtain the same costs of UniAlternate with a very different protocol, MinMax, described in Figure 3.29. We have assumed that all entities start. When removing this assumption we have two options: The entities that are not initiators can be (i) made to start (as if they were initiators) upon receiving their ﬁrst message or (ii) transformed into passive and just act as relayers. The second option is the one used in Figure 3.29. We have also assumed Message Ordering in our discussion. As with all the other protocols we have considered, this restriction can be enforced with just local bookkeeping at each entity, without any increase in complexity (Exercise 3.10.39).

146

ELECTION

Procedure PROCESS ENVELOPE begin if odd(stage*) then if value* < value then stage= stage+1; value:= value*; send ("Envelope", value*, stage); else become DEFEATED; else if value* > value then stage= stage+1; value:= value*; send ("Envelope", value, stage); else become DEFEATED; endif endif end

FIGURE 3.30: Procedure Process Envelope of Protocol MinMax.

Hacking: Employing the Defeated () The different approach used in protocol MinMax has led to a different way of obtaining the same efﬁciency as we had already with UniAlternate. The advantage of MinMax is that it is possible to obtain additional improvements that lead to a signiﬁcantly better performance. Observe that like in most previous protocols, the defeated entities play a purely passive role, that is, they just forward messages. The key observation we will use to obtain an improvement in performance is that these entities can be exploited in the computation. Let us concentrate on the even stages and see if we can obtain some savings for those steps. The message sent by a candidate travels (forwarded by the defeated entities) until it encounters the next candidate. This distance can vary and can be very large. What we will do is to control the maximum distance to which the message will travel, following the idea we developed in Section 3.3.3. (I) in an even step j , a message will travel no more than a predeﬁned distance dis(j ). This is implemented by having in the message a counter (initially set to dis(j )) that will be decreased by one by each defeated node it passes. What is the appropriate choice of dis(i) will be discussed next. Every change we make in the protocol has strong consequences. As a consequence of (I ), the message from x might not reach the next candidate y if it is too far away (more than dis(j )) (see Figure 3.31). In this case, the candidate y does not receive the message in this stage and, thus, does not know what to do for the next stage. IMPORTANT. It is possible that every candidate is too far away from the next one in this stage, and hence none of them will receive a message.

ELECTION IN RINGS

x

z

y

z

y

147

(a)

x (b)

FIGURE 3.31: Protocol MinMax+. Controlling the distance: In even stage j , the message does not travel more than dis(j ) nodes. (a) If it does not reach the next candidate y, the defeated node reached last, z, will become candidate and start the next step; (b) in the next step, the message from z transforms into defeated the entity y still waiting for the stage j message.

However, if candidate y does not receive the message from x, it is because the counter of the message containing (v, j) reaches 0 at a defeated node z, on the way from x to y (see Figure 3.31). To ensure progress (i.e., absence of deadlock), we will make that defeated z become candidate and start the next stage j + 1 immediately, sending (v, j+1). That is, (II) in an even step j , if the counter of the message reaches 0 at a defeated node z, then z becomes candidate and starts stage j + 1 with value = v*, where v* is the value in the transfer message. In other words, we are bringing some defeated nodes back into the game making them candidates again. This operation could be dangerous for the complexity of the protocol as the number of candidates appears to be increasing (and not decreasing). This is easily taken care of: The originators, like y, waiting for a transfer message that will not arrive will become defeated. Question. How will y know that it is defeated? The answer is simple. The candidate that starts the next stage (e.g., z in our example) sends a message; when this message reaches a candidate (e.g., y) still waiting for a message from the previous stage, that entity will understand, become defeated, and forward the message. In other words, (III) when, in an even step, a candidate receives a message for the next step, it becomes defeated and forwards the message. We are giving decisional power to the defeated nodes, even bringing some of them back to “life.” Let us push this concept forward and see if we can obtain some other savings. Let us concentrate on the odd stages.

148

ELECTION

Consider an even stage i in MinMax (e.g., Figure 3.28). Every candidate x sends its message containing the value and the stage number and receives a message; it becomes defeated if the received value is smaller than the one it sent. If it survives, x starts stage i + 1: It sends a message with the received value and the new stage number (see Figure 3.28(b)); this message will reach the next candidate. Concentrate on the message (11, 3) in Figure 3.28(b) sent by x. Once (11, 3) reaches its destination y, as 11 < 22 and we are in a odd (i.e., min) stage, a new message (11, 4) will be originated. Observe that the fact that (11, 4) must be originated can be discovered before the message reaches y (see Figure 3.32(c)). In fact, on its travel from x to y, message (11, 3) will reach the defeated node z that originated (20, 2) in the previous stage; once this happens, z knows that 11 will survive this stage (Exercise 3.10.40). What z will do is to become candidate again and immediately send (11, 4). (IV) When, in an even stage, a candidate becomes defeated, it will remember the stage number and the value it sent. If, in the next stage, it receives a message with a smaller value, it will become candidate again and start the next stage with that value. In our example, this means that the message (11, 3) from x will stop at z and never reach y; thus, we will save d(z, y) messages. Notice that in this stage every message with a smaller value will be stopped earlier. We have, however, transformed a defeated entity into a candidate. This operation could be dangerous for the complexity of the

(9, 2)

(11, 2) 11

(10, 2)

(20, 2)

10

20

x

z

(22, 2) 22

(13, 2) 13 y

(a)

(12, 3)

(11, 3)

(22, 3)

11

x

22

z

y

(b) (12, 3)

(11, 4) 11

x

z

y

(c)

FIGURE 3.32: Protocol MinMax+. (a) Early promotion in odd stages. (b) The message (11, 3) from x, on its way to y, reaches the defeated node z that originated (20, 2). (c) Node z becomes candidate and immediately originates envelope (11, 4).

ELECTION IN RINGS

149

protocol as the number of candidates appears to be increasing (and not decreasing). This is easily taken care of: This candidates, like y, waiting for a message of an odd stage that will not arrive will become defeated. How will y know that is defeated ? The answer again is simple. The candidate that starts the next stage (e.g., z in our example) sends the message; when this message reaches an entity still waiting for a message from the previous stage (e.g., y), that entity will understand, become defeated, and forward the message. In other words, (V) When, in an odd step, a candidate receives a message for the next step, it becomes defeated and forwards the message. The modiﬁcations to MinMax described by (I)–(V) generate a new protocol that we shall call MinMax+ (Exercises 3.10.41 and 3.10.42). Messages Let us estimate the cost of protocol MinMax+. First of all observe that in protocol MinMax, in each stage a message (v, i) would always reach the next candidate in that stage. This is not necessarily so in MinMax+. In fact, in an even stage i no message will travel more than dis(i), and in an odd stage a message can be “promoted” by a defeated node on the way. We must concentrate on the savings in each type of stages. Consider a message (v, i); denote by hi (v) the candidate that originates it, and if the message is discarded in this stage, denote by gi (v) the node that discards it. For the even stages, we must ﬁrst of all choose the maximum distance dis(i) a message will travel. We will use dis(i) = Fi+2 With this choice of distance, we have a very interesting property. Property 3.3.1 Let i be even. If message (v, i) is discarded in this stage, then d(hi (v), gi (v)) ≥ Fi . For any message (v, i + 1), d(hi (v), hi+1 (v)) ≥ Fi+1 . This property allows us to determine the number of stages σMinMax+ : In an even stage i, the distance traveled by any message is at least Fi ; however, none of these messages travels beyond the next candidate in the ring. Hence, the distance between two successive candidates in an odd stage i is at least Fi ; this means that the number ni of candidates is at most ni ≤ Fni . Hence, the number of stages will be at most Fn−1 + O(1), where Fn−1 is the smallest integer j such that Fj ≥ n. Thus the algorithm will use at most σMinMax+ ≤ 1.44 log n + O(1) stages. This is the same as protocol MinMax.

150

ELECTION

The property also allows us to measure the number of messages we save in the odd stages. In our example of Figure 3.32(b), message (11, 3) from x will stop at z and never reach y; thus, we will save d(z, y) transmissions. In general, a message with value v that reaches an even stage i + 1 (e.g., (11, 4)) saves at least Fi transmissions in stage i (Exercise 3.10.44). The total number of transmissions in an odd stage i is, thus, at most n − ni+1 Fi , where ni+1 denotes the number of candidates in stage i + 1. The total number of messages in an even stage is at most n. As in an even stage i + 1 each message travels at most Fi+3 (by Property 3.3.1), the total number of message transmissions in an even stage i + 1 will be at most ni+1 Fi+3 . Thus, the total number of messages in an even stage i + 1 is at most Min{n, ni+1 Fi+3 }. If we now consider an odd stage i followed by an even stage i + 1, the total number of message transmissions in the two stages will be at most i < n(4 − Min{n + ni+1 (Fi+3 − Fi ), 2n − ni+1 Fi } ≤ 2n − n FFi+3

where φ =

√ 1+ 5 2 .

√

5 + φ −2i ),

Hence,

√ 4− 5 M[MinMax+] ≤ n logφ (n) + O(n) < 1.271 n log n + O(n). 2

(3.27)

Thus, protocol MinMax+ is the most efﬁcient protocol we have seen so far, with respect to the worst case. 3.3.8 Limits to Improvements () Throughout the previous sections, we have reduced the message costs further and further using new tools or combining existing ones. A natural question is how far we can go. Considering that the improvements have only been in the multiplicative constant of the n log n factor, the next question becomes: Is there a tool or a technique that would allow us to reduce the message costs for election signiﬁcantly, for example, from O(n log n) to O(n)? These type of questions are all part of a larger and deeper one: What is the message complexity of election in a ring ? To answer this question, we need to establish a lower bound, a limit that no election protocol can improve upon, regardless of the amount and cleverness of the design effort. In this section we will see different bounds, some for unidirectional rings and others for bidirectional ones, depending on the amount of a priori knowledge the

ELECTION IN RINGS

151

entities have about the ring. As we will see, in all cases, the lower bounds are all of the form ⍀(n log n). Thus, any further improvement can only be in the multiplicative constant. Unidirectional Rings We want to know what is the number of messages that any election algorithm for unidirectional rings must transmit in the worst case. A subtler question is to determine the number of messages that any solution algorithm must transmit on the average; clearly, a lower bound on the average case is also a lower bound on the worst case. We will establish a lower bound under the standard assumptions of Connectivity and Total Reliability, plus Initial Distinct Values (required for election), and obviously Ring. We will actually establish the bound assuming that there is Message Ordering; this implies that in systems without Message Ordering, the bound is at least as bad. The lower bound will be established for minimum-ﬁnding protocols; because of the Initial Distinct Values restriction, every minimum-ﬁnding protocol is also an election protocol. Also, we know that with the additional n messages, every election protocol becomes a minimum-ﬁnding protocol. When a minimum-ﬁnding algorithm is executed in a ring of entities with distinct values, the total number of transmitted messages depends on two factors: communication delays and the assignment of initial values. = (x0 , x1 , . . . , xn−1 ); let si = id(xi ) be the Consider the unidirectional ring R unique value assigned to xi . The sequence s = s1 , s2 , . . . , sn , thus, describes the assignment of ids to the entities. Denote by S the set of all such assignments. Given a ring R of size n and an is labeled by s, and denote it by R(s). assignment s ∈ S of n ids, we will say that R Let A be a minimum-ﬁnding protocol under the restrictions stated above. Consider the executions of A started simultaneously by all entities and their cost. The average and the worst-case costs of these executions are possibly better but surely not worse than the average and the worst-case costs, respectively, over all possible executions; thus, if we ﬁnd them, they will give us a lower bound. Call global state of an entity x at time t, the content of all its local registers and variables at time t. As we know, the entities are event driven. This means that for a ﬁxed set of rules A, their next global state will depend solely on the current one and on what event has occurred. In our case, once the execution of A is started, the only external events are the arrival of messages. During an action, an entity might send one or more messages to its only outneighbor; if it is more than one, we can “bundle” them together as they are all sent within the same action (i.e., before any new message is received). Thus, we assume that in A, only one message is sent in the execution of an action by an entity. Associate to each message all the “history” of that message. That is, with each message M, we associate a sequence of values, called trace, as follows: (1) If the sender has id si and has not previously received any message, the trace will be just 1

The converse is not true.

152

ELECTION

si . (2) If the sender has id si and its last message previously received has trace

l1 , . . . , lk−1 , k > 1, the trace will be l1 , . . . , lk−1 , si , which has length k. Thus, a message M with trace si , si+1 , . . . , si+k indicates that a message was originally sent by entity xi ; as a reaction, the neighbor xi+1 sent a message; as a reaction, the neighbor xi+2 sent a message; . . . ; as a reaction, xi+k sent the current message M. IMPORTANT. Note that because of our two assumptions (simultaneous start by all entities and only one message per action), messages are uniquely described by their associated trace. We will denote by ab the concatenation of two sequences a and b. If d = abc, then a, b, and c are called subsequences of d; in particular, each of a, ab, and abc will be called a preﬁx of d; each of c, bc, and abc will be called a sufﬁx of d. Given a sequence a, we will denote by len(a) the length of a and by C(a) the set of cyclic permutations of a; clearly, |C(a)| = len(a). Example If d = 2, 15, 9, 27, then len(d) = 4; the subsequences 2, 2, 15,

2, 15, 9, and 2, 15, 9, 27 are preﬁxes; the sequences 27, 9, 27, 15, 9, 27, and 2, 15, 9, 27 are sufﬁxes; and C(d) = { 2, 15, 9, 27, 15, 9, 27, 2,

9, 27, 2, 15, 27, 2, 15, 9}. The key point to understand is the following: If in two different rings, for example, in R(a) and in R(b), an entity executing A happens to have the same global state, and it receives the same message, then it will perform the same action in both cases, and the next global state will be the same in both executions. Recall Property 1.6.1. Let us use this point. Lemma 3.3.1 Let a and b both contain c as a subsequence. If a message with trace c is sent in an execution of A on R(a), then c is sent in an execution of A on R(b). Proof. Assume that a message with trace c = si , . . . , si+k is sent when executing A on R(a). This means that when entity xi started the trace, it had not received any other message, and so, the transmission of this message was part of its initial “spontaneous” action; as the nature of this action depends only on A, xi will send the message both in R(a) and in R(b). This message was the ﬁrst and only message xi+1 received from xi both in R(a) and in R(b); in other words, its global state until it received the message with trace starting with si was the same in both rings; hence, it will send the same message with trace si , si+1 to xi+2 in both situations. In general, between the start of the algorithm and the arrival of a message with trace si , . . . , sj −1 , entity xj with id sj , i j ≤ i + k is in the same global state and sends and receives the same message in both R(a) and R(b); thus, it will send a message with trace si , . . . , sj −1 , sj regardless of whether the input sequence is a or b. (a) has a message with trace c, then there is an Thus, if an execution of A in R execution of A in R(b) that has a message with trace c. 䊏

ELECTION IN RINGS

153

In other words, if R(a) and R(b) have a common segment c (i.e., a consecutive group of len(c) entities in R(a) has the same ids as a consecutive group of entities in R(b)), the entity at the end of the segment cannot distinguish between the two rings when it sends the message with trace c. As different assignments of values to rings may lead to different results (i.e., different minimum values), the protocol A must allow the entities to distinguish between those assignments. As we will see, this will be the reason ⍀(n log n) messages are needed. To prove it, we will consider a set of assignments on rings, which makes distinguishing among them “expensive” for the algorithm. A set E ⊆ S of assignments of values is called exhaustive if it has the following two properties: 1. Preﬁx Property: For every sequence belonging to E, its nonempty preﬁxes also belong to E, that is, if ab ∈ E and len(a) ≥ 1, then a ∈ E. 2. Cyclic Permutation Property: Whether an assignment of values s belongs or not belongs to E, at least one of its cyclic permutations belongs to E, that is, if s ∈ S, then C(s) ∩ E = φ Lemma 3.3.2

A has an exhaustive set E(A) ⊆ S.

Proof. Deﬁne E(A) to be the set of all the arrangements s ∈ S such that a message with trace s is sent in the execution of A in R(s). To prove that this set is exhaustive, we need to show that the cycle permutation property and the preﬁx property hold. To show that the preﬁx property is satisﬁed, choose an arbitrary s = ab ∈ E(A) with len(a) ≥ 1; by deﬁnition of E(A), there will be a message with trace ab when executing A in R(ab); this means that in R(ab) there will also be a message with trace a. Consider now the (smaller) ring R(a); as a is a subsequence of both ab and (obviously) a, and there was a message with that trace in R(ab), by Lemma 3.3.1 there will be a message with trace a also in R(a); but this means that a ∈ E(A). In other words, the sufﬁx property holds. To show that the cyclic permutation property is satisﬁed, choose an arbitrary s = s1 , . . . , sk ∈ S and consider R(s). At least one entity must receive a message with a trace of length k, otherwise the minimum value could not have been determined; then t is a cyclic permutation of s. Furthermore, as t is a trace in R(t), t ∈ E(A). Summarizing, t ∈ E(A) ∪ S(s). In other words, the cyclic permutation property holds. 䊏 Now we are going to measure how expensive it is for the algorithm A to distinguish between the elements of E(A). Let m(s, E) be the number of sequences in E ⊆ S, which are preﬁxes of some cyclic permutation of s ∈ S, and mk (s, E) denote the number of those that are of length k > 1. costs at least m(s, E(A)) messages. Lemma 3.3.3 The execution of A in R(s)

154

ELECTION

Proof. Let t ∈ E(A) be the preﬁx of some r ∈ C(s). That is, a message with trace and because of Lemma 3.3.1, a message with trace t is sent also in t is sent in R(t) R(r); as r ∈ C(s), a message with trace t is sent also in R(r). That is, for each preﬁx t ∈ E(A) of a cyclic permutation of s, there will be a message sent with trace t. The number of such preﬁxes t is by deﬁnition m(s, E(A)). 䊏 Let I = {s1 , s2 , . . . , sn } be the set of ids, and Perm(I ) be the set of permutations of I . Assuming that all n! permutations in Perm(I ) are equally likely, the average number aveA (I ) of messages sent by A in the rings labeled by I will be the average message cost of A among the rings R(s), where s ∈ Perm(I ). By Lemma 3.3.3, this means the following: 1 aveA (I ) ≥ n! m(s, E(A)). s∈Perm(I )

By deﬁnition of mk (s, E(A)), we have aveA (I ) ≥

1 n!

n

mk (s, E(A)) =

s∈Perm(I ) k=1

1 n!

n

mk (s, E(A)).

k=1 s∈Perm(I )

We need to determine what s∈Perm(I ) mk (s, E(A)) is. Fix k and s ∈ Perm(I ). Each cyclic permutation C(s) of s has only one preﬁx of length k. In total, there are n preﬁxes of length k among all the cyclic permutations of s ∈ Perm(I ). As there are n! elements in Perm(I ), there are n! n instances of such preﬁxes for a ﬁxed k. These n! n preﬁxes can be partitioned in groups Gkj of size k, by putting together all the cyclic permutations of the same sequence; there will be q = n!k n such groups. As E(A) is exhaustive, by the cyclic permutation property, the set E(A) intersects each group, that is, |E(A) ∪ Gkj | ≥ 1.

mk (s, E(A)) ≥

s∈Perm(I )

q j =1

|E(A) ∪ Gkj | ≥

n!n k .

Thus, aveA (I ) ≥

1 n!

n k=1

n!n k

≥n

n k=1

1 k

= nHn ,

where Hn is the nth harmonic number. This lower bound on the average case is also a lower bound on the number worstA (I ) of messages sent by A in the worst case in the rings labeled by I : worstA (I ) ≥ aveA (I ) ≥ nHn ≈ 0.69 n log n + O(n).

(3.28)

This result states that ⍀(n logn) messages are needed in the worst case by any solution protocol (the bound is true for every A), even if there is Message Ordering. Thus, any improvement we can hope to obtain by clever design will at most reduce the constant; in any case, the constant cannot be smaller than 0.69. Also, we cannot expect

ELECTION IN RINGS

155

to design election protocols that might have a bad worst case but cost dramatically less on an average. In fact, ⍀(n logn) messages are needed on an average by any protocol. Notice that the lower bound we have established can be achieved. In fact, protocol AsFar requires on an average nHn messages (Theorem 3.3.1). In other words, protocol AsFar is optimal on an average. If the entities know n, it might be possible to develop better protocols exploiting this knowledge. In fact, the lower bound in this case leaves a little more room but again the improvement can only be in the constant (Exercise 3.10.45): worstA (I |n known) ≥ aveA (I |n known) ≥

1 −ε 4

n log n.

(3.29)

So far no better protocol is known. Bidirectional Rings In bidirectional rings, the lower bound is slightly different in both derivation and value (Exercise 3.10.46): worstA (I ) ≥ aveA (I ) ≥

1 nHn ≈ 0.345 n log n + O(n). 2

(3.30)

Actually, we can improve this bound even if the entities know n (Exercise 3.10.47): worstA (I : n known) ≥ aveA (I : n known) ≥

1 n log n. 2

(3.31)

That is, even with the additional knowledge of n, any improvement can only be in the constant. So far, no better protocol is known. Practical and Theoretical Implications The lower bounds we have discussed so far indicate that ⍀(n log n) messages are needed both in the worst case and on the average, regardless of whether the ring is unidirectional or bidirectional, and whether n is known or not. The only difference between these cases will be in the constant. In the previous sections, we have seen several protocols that use O(n log n) messages in the worst case (and are thus optimal); their cost provides us with upper bounds on the complexity of leader election in a ring. If we compare the best upper and lower bounds for unidirectional rings with those for bidirectional rings, we notice the existence of a very surprising situation: The bounds for unidirectional rings are “better” than those for bidirectional ones; the upper bound is smaller and the lower bound is bigger (see Fig. 3.33 and 3.34). This fact has strange implications: As far as electing a leader in a ring is concerned, unidirectional rings seem to be better systems than bidirectional ones, which in turn implies that practically half-duplex links are better than full-duplex links.

156

ELECTION

bidirectional All the Way AsFar ProbAsFar Control Stages StagesFbk Alternate BiMinMax lower bound

worst case n2 n2 n2 6.31n log n + O(n) 2n log n + O(n) 1.89n log n + O(n) 1.44n log n + O(n) 1.44n log n + O(n)

average n2 0.69n log n + O(n) 0.49n log n + O(n)

notes

oriented ring 0.5n log n + O(n)

n = 2p known

FIGURE 3.33: Summary of bounds for bidirectional rings.

This is clearly counterintuitive: In terms of communication hardware, Bidirectional Links are clearly more powerful than half-duplex links. On the contrary, the bounds are quite clear: Election protocols for unidirectional rings are more efﬁcient than those for bidirectional ones. A natural reaction to this strange status of affairs is to suggest the use in bidirectional rings of unidirectional protocols; after all, with Bidirectional Links we can send in both directions, “left” and “right,” so we can just decide to use only one, say “right.” Unfortunately, this argument is based on the hidden assumption that the bidirectional ring is also oriented, that is, “right” means the same to all processors. In other words, it assumes that the labeling of the port numbers, which is purely local, is actually globally consistent. This explains why we cannot use the (more efﬁcient) unidirectional protocol in a generic bidirectional ring. But why should we do better in unidirectional rings? The answer is interesting—In a unidirectional ring, there is orientation: Each entity has only one out-neighbor; so there is no ambiguity as to where to send a message. In other words, we have discovered an important principle of the nature of distributed computing: Global consistency is more important than hardware communication power.

unidirectional All the Way AsFar UniStages UniAlternate MinMax MinMax+ lower bound lower bound

worst case n2 n2 2n log n + O(n) 1.44n log n + O(n) 1.44n log n + O(n) 1.271n log n + O(n)

average n2 0.69n log n + O(n)

0.69n log n + O(n) 0.25n log n + O(n)

notes

n = 2p known

FIGURE 3.34: Summary of bounds for unidirectional rings.

ELECTION IN RINGS

157

This principle is quite general. In the case of rings, the difference is not much, just in the multiplicative constant. As we will see in other topologies, this difference can actually be dramatic. If the ring is both bidirectional and oriented, then we can clearly use any unidirectional protocol as well as any bidirectional one. The important question is whether in this case we can do better than that. That is, the quest is for a protocol for bidirectional oriented rings that 1. fully exploits the power of both full-duplex links and orientation; 2. cannot be used or simulated in unidirectional rings, nor in general bidirectional ones; and 3. is more efﬁcient than any unidirectional protocol or general bidirectional one. We have seen a protocol for oriented rings, Alternate; however, it can be simulated in unidirectional rings (protocol UniAlternate). To date, no protocol with such properties is known. It is not even known whether it can exist (Problem 3.10.7). 3.3.9 Summary and Lessons We have examined the design of several protocols for leader election in ring networks and analyzed the effects that design decisions have had on the costs. When developing the election protocols, we have introduced some key strategies that are quite general in nature and, thus, can be used for different problems and for different networks. Among them are the idea of electoral stages and the concept of controlled distances. We have also employed ideas and tools, for example, feedback and notiﬁcation, already developed for other problems. In terms of costs, we have seen that ⌰(n log n) messages will be used both in the worst case and on the average, regardless of whether the ring is unidirectional or bidirectional, oriented or unoriented, and n is known or not. The only difference is in the multiplicative constant. The bounds are summarized in Figures 3.33 and 3.34. As a consequence of these bounds, we have seen that orientation of the ring is, so far, more powerful than presence of Bidirectional Links. Both ring networks and tree networks have very sparse topologies: m = n − 1 in trees and m = n in rings. In particular, if we remove any single link from a ring, we obtain a tree. Still, electing a leader costs ⌰(n log n) in rings but only ⌰(n) in trees. The reason for such a drastic complexity difference has to be found not in the number of links but instead in the properties of the topological structure of the two types of networks. In a tree, there is a high level of asymmetry: We have two types of nodes internal nodes and leaves; it is by exploiting such asymmetry that election can be performed in a linear number of messages. On the contrary, a ring is a highly symmetrical structure, where every node is indistinguishable from another. Consider that the election task is really a task of breaking symmetry: We want one entity to become different from all others. The entities already have a behavioral symmetry: They all have the same set of rules and the same initial state, and potentially they

158

ELECTION

are all initiators. Thus, the structural symmetry of the ring topology only makes the solution to the problem more difﬁcult and more expensive. This observation reﬂects a more general principle: As far as election is concerned, structural asymmetry is to the protocol designer’s advantage; on the contrary, the presence of structural symmetry is an obstacle for the protocol designer. 3.4 ELECTION IN MESH NETWORKS Mesh networks constitute a large class of architectures that includes meshes and tori; this class is popular especially for parallel systems, redundant memory systems, and interconnection networks. These networks, like trees and rings, are sparse: m = O(n). Using our experience with trees and rings, we will now approach the election problem in such networks. Unless otherwise stated, we will consider Bidirectional Links. 3.4.1 Meshes A mesh M of dimensions a × b has n = a × b nodes, xi,j , 1 ≤ i ≤ a, 1 ≤ j ≤ b. Each node xi,j is connected to xi−1,j , xi,j −1 , xi+1,j , xi,j +1 if they exist; let us stress that these names are used for descriptive purposes only and are not known to the entities. The total number of links is thus m = a(b − 1) + b(a − 1) = 2ab − a − b (see Figure 3.35). Observe that in a mesh, we have three types of nodes: corner (entities with only two neighbors), border (entities with three neighbors), and interior (with four neighbors) nodes. In particular, there are four corner nodes, 2(a + b) border nodes, and n − 2(a + b − 2) interior nodes. Unoriented Mesh The asymmetry of the mesh can be exploited to our advantage when electing a leader: As it does not matter which entity becomes leader, we can elect one of the four corner nodes. In this way, the problem of choosing a leader among (possibly) n nodes is reduced to the problem of choosing a leader among the x1,1

x4,5 FIGURE 3.35: Mesh of dimension 4 × 5.

ELECTION IN MESH NETWORKS

159

four corner nodes. Recall that any number of nodes can start (each unaware of when and where the others will start, if at all); thus, to achieve our goal, we need to design a protocol that ﬁrst of all makes the corners aware of the election process (they might not be initiators at all) and then performs the election among them. The ﬁrst step, to make the corners aware, can be performed doing a wake-up of all entities. When an entity wakes up (spontaneously if it is an initiator, upon receiving a wake-up message otherwise), its subsequent actions will depend on whether it is a corner, a border, or an interior node. In particular, the four corners will become awake and can start the actual election process. Observe the following interesting property of a mesh: If we consider only the border and corner nodes and the links between them, they form a ring network. We can, thus, elect a leader among the corners by using a election protocol for rings: The corners will be the only candidates; the borders will act as relayers (defeated nodes). When one of the corner nodes is elected, it will notify all other entities of termination. Summarizing, the process will consist of: 1. wake-up, started by the initiators; 2. election (on outer ring), among the corners; 3. notiﬁcation (i.e., broadcast) started by the leader; Let us consider these three activities individually. (1) Wake up is straightforward. Each of the k initiators will send a wake-up to all its neighbors; a noninitiator will receive the wake-up message from a neighbor and forward it to all its other neighbors (no more than three); hence the number of messages (Exercise 3.10.48) will be no more than 3n + k . (2) The election on the outer ring requires a little more attention. First of all, we must choose which ring protocol we will use; clearly, the selection is among the efﬁcient ones we have discussed at great length in the preceding sections. Then we must ensure that the messages of the ring election protocol are correctly forwarded along the links of the outer ring. Let us use protocol Stages and consider the ﬁrst stage. According to the protocol, each candidate (in our case, a corner node) sends a message containing its value in both directions in the ring; each defeated entity (in our case, a border node) will forward the message along the (outer) ring. Thus, in the mesh, each corner node will send a message to the only two neighbors. A border node y, however, has three neighbors, of which only two are in the outer ring; when y receives the message, it does not know to which of the other two ports it must forward the message. What we will do is simple; as we do not know to which port the message must be sent, we will forward it to both: One will be along the ring and proceed safely, and the other will instead reach an interior node z; when the

160

ELECTION

interior node z receives such an election message, it will reply to the border node y “I am in the interior,” so no subsequent election messages are sent to it. Actually, it is possible to avoid those replies without affecting the correctness (Exercise 3.10.50). In Stages, the number of candidates is at least halved every time. This means that after the second stage, one of the corners will determine that it has the smallest id among the four candidates and will become leader. Each stage requires 2n messages, where n = 2(a + b − 2) is the dimension of the outer ring. An additional 2(a + b − 4) messages are unknowingly sent by the border to the interior in the ﬁrst stage; there are also the 2(a + b − 4) replies from those interior nodes, that, however, can be avoided (Exercise 3.10.50). Hence, the number of messages for the election process will be at most 4(a + b − 2) + 2(a + b − 4) = 6(a + b) − 16. IMPORTANT. Notice that in a square √mesh (i.e., a = b), this means that the election process proper can be achieved in O( n) messages. (3) Broadcasting the notiﬁcation can be performed using Flood, which will require less than 3n messages as it is started by a corner. Actually, with care, we can ensure that less than 2n messages are sent in total (Exercise 3.10.49). Thus in total, the protocol ElectMesh we have designed will have cost 6(a + b) + 5n + k − 16. With a simple modiﬁcation to the protocol, it is possible to save an additional 2(a + b − 4) messages (Exercise 3.10.51), achieving a cost of at most M[ElectMesh] ≤ 4(a + b) + 5n + k − 32.

(3.32)

NOTE. The most expensive operation is to wake up the nodes. Oriented Mesh A mesh is called oriented if the port numbers are the traditional compass labels (north, south, east, west) assigned in a globally consistent way. This assignment of labels has many important properties, in particular, one called sense of direction that can be exploited to obtain efﬁcient solutions to problems such as broadcast and traversal (Problems 3.10.52 and 3.10.53). For the purposes of election, in an oriented mesh, it is trivial to agree on a unique node. For example, there is only one corner with link labels “south” and “west.” Thus, to elect a leader in an oriented mesh, we must just ensure that that unique node knows that it must become leader. In other words, the only part needed is a wake-up: Upon becoming awake, and participating in the wake-up process, an entity can immediately become leader or follower depending on whether or not it is southwest corner.

ELECTION IN MESH NETWORKS

161

Notice that in an oriented mesh, we can exploit the structure of the mesh and the orientation to perform a wakeup with fewer than 2n messages (Problem 3.10.54). Complexity These results mean that regardless of whether the mesh is oriented or not, a leader can be elected with O(n) messages, the difference being solely in the multiplicative constant. As no election protocol for any topology can use fewer than n messages, we have Lemma 3.4.1

M(Elect/IR ; Mesh) = ⌰(n)

3.4.2 Tori Informally, the torus is a mesh with “wrap-around” links that transform it into a regular graph: Every node has exactly four neighbors. A torus of dimensions a × b has n = ab nodes vi,j (0 ≤ i ≤ a − 1,0 ≤ j ≤ b − 1); each node vi,j is connected to four nodes vi,j +1 , vi,j −1 , vi+1,j , and vi−1,j , where all the operations on the ﬁrst index are modulo a, while those on the second index are modulo b (e.g., see Figure 3.36). In the following sections, we will focus on square tori (i.e., where a = b). Oriented Torus We will ﬁrst develop an election protocol assuming that there is the compass labeling (i.e., the links are consistently labeled as north, south, east, and west, and the dimensions are known); we will then see how to solve the problem also when the labels are arbitrary. A torus with such a labeling is said to be oriented. In designing the election protocol, we will use the idea of electoral stages developed originally for ring networks and also use the defeated nodes in an active way. We will also employ a new idea, marking of territory. (I) In stage i, each candidate x must “mark” the boundary of a territory Ti (a di × di region of the torus), where di = α i for some ﬁxed constant α > 1; initially v0,0

v3,4

FIGURE 3.36: Torus of dimension 4 × 5.

162

ELECTION

Ti+2

Ti+1 y

Ti x

FIGURE 3.37: Marking the territory. If the territories of two candidates intersect, one of them will see the marking of the other.

the territory is just the single candidate node. The marking is done by originating a “Marking” message (with x’s value) that will travel to distance di ﬁrst north, then east, then south, and ﬁnally west to return to x. A very important fact is that if the territory of two candidates have some elements in common, the “Marking” message of at least one of them will encounter the marking of the other (Figure 3.37). (II) If the “Marking” message of x does not encounters any other marking of the same stage, x survives this stage, enters stage i + 1, and starts the marking of a larger territory Ti+1 . (III) If the “Marking” message arrives at a node w already marked by another candidate y in the same stage, the following will occur: 1. If y has a larger id, the “Marking” message will continue to mark the boundary, setting a boolean variable SawLarger to true. 2. If the id of y is instead smaller, then w will terminate the “Marking” message from x; it will then originate a message “SeenbyLarger(x, i)” that will travel along the boundary of y’ territory. If candidate x receives both its “Marking” message with SawLarger = true and a “SeenbyLarger” message, x survives this stage, enters stage i + 1, and starts the marking of a larger territory Ti+1 . Summarizing, for a candidate x to survive, it is necessary that it receives its “Marking” message back. If SawLarger = false, then that sufﬁces; if SawLarger = true, x must also receive a “SeenbyLarger” message. Note that if x receives a “SeenbyLarger(z, i)” message, then z did not ﬁnish marking its boundary; thus z does not survives this stage. In other words, if x survives, either its message found no other markings, or at least another candidate does not survive. 2

Distances include the starting node.

ELECTION IN MESH NETWORKS

163

(IV) A relay node w might receive several “Marking” messages from different candidates in the same stage. It will only be part of the boundary of the territory of the candidate with the smallest id. This means that if w was part of the boundary of some candidate x and now becomes part of the boundary of y, a subsequent “SeenbyLarger” message intended for x will be sent along the boundary of y. This is necessary for correctness. To keep the number of messages small, we will also limit the number of “SeenbyLarger” messages sent by a relayer. (V) A relay node will only forward one “SeenbyLarger” message. √ The algorithm continues in this way until di ≥ n. In this case, a candidate will receive its “Marking” message from south instead of east because of, the “wraparound” in the torus; it then sends the message directly east, and will wait for it to arrive from west. (VI) When a wrap-around is detected (receive its “Marking” message from south rather than from east), a candidate x sends the message directly east, and waits for it to arrive from west. If it survives, in all subsequent stages the marking becomes simpler. (VII) In every stage after wrap-around, a candidate x sends its “Marking” message ﬁrst north and waits to receive it from south, then it sends it east, and waits for it to arrive from west. The situation where there is only one candidate left will be for sure reached after a constant number p of stages after the wrap-around occurs, as we will see later. (VIII) If a candidate x survives p stages after wrap-around, it will become leader and notify all other entities of termination. Let us now discuss the correctness and cost of the algorithm, protocol MarkBoundary, we have just described. Correctness and Cost For the correctness, we need to show progress, that is, at least one candidate survives each stage of the algorithm, and termination, that is, p stages after wrap-around there will be only one candidate left. Let us discuss progress ﬁrst. A candidate whose “Marking” message does not encounter any other boundary will survive this stage; so the only problem would be if, in a stage, every “Marking” message encounters another candidate’s boundary, and somehow none of them advances. We must show that this cannot happen. In fact, if every “Marking” message encounters another candidate’s boundary, the one with the largest id will encounter a smaller id; the candidate with this smaller id will go onto the next stage unless its message encounters the boundary with an even smaller id, and so on; however, the message of the candidate with the smallest id cannot encounter a larger id (because it is the smallest) and, thus, that entity would survive this stage. For termination, the number of candidates does decrease overall, but not in a simple way. However, it is possible to bound the maximum number of candidates

164

ELECTION

in each stage, and that bound strictly decreases. Let ni be the maximum number of candidates in stage i. Up until wrap-around, there are two types of survivors: (a) those entities whose message did not encounter any border and (b) those whose message encountered a border with a larger id and whose border was encountered by a message with a larger id. Let ai denote the number of the ﬁrst type of survivors; clearly ai ≤ n/di2 . The number of the second type will be at most (ni − ai )/2 as each defeated one can cause at most one candidate to survive. Thus, ni+1 ≤ ai + (ni − ai )/2 = (ni + ai )/2 ≤ ni +

n di2

/2.

As di = α i is increasing each stage, the upper bound ni on the number of candidates is decreasing. Solving the recurrence relation gives ni+1 ≤ n/α 2i (2 − α 2 ).

(3.33)

√ Wrap-around occurs when α i ≥ n; in that stage, only one candidate can complete the marking of its boundary without encountering any markings and at most half the remaining candidates will survive. So, the number of candidates surviving this stage is at most (2 − α 2 )−1 . In all subsequent stages, again only one candidate can complete the marking without encountering any markings and at most half the remaining candidates will survive. Hence, after p > log(2 − α 2 )−1 additional stages for sure there will be only one candidate left. Thus, the protocol correctly terminates. To determine the total number of messages, consider that in stage i before wraparound, each candidate causes at most 4di “Marking” messages to mark its boundary and another 4di “SeenbyLarger” messages, for a total of 8di = 8α i messages; as the number of candidates is at most as expressed by equation 3.33, the total number of messages in this pre-wrap-around stage will be at most O(nα 2 /(2 − α 2 )(α − 1)). In each phase√ after wrap-around, there is only a constant number of candidates, each sending O( n) messages. As the number of√such phases is constant, the total number of messages sent after wrap-around is O( n). Choosing α ≈ 1.1795 yields the desired bound M[MarkBorder] = ⌰(n).

(3.34)

The preceding analysis ignores the fact that α i is not an integer: The distance to travel must be rounded up and this has to be taken into account in the analysis.

ELECTION IN MESH NETWORKS

165

However, the effect is not large and will just affect the low-order terms of the cost (Exercise 3.10.55). The algorithm as given is not very time efﬁcient. In fact, the ideal time can be as bad as O(n) (Exercise 3.10.56). The protocol can be, however, modiﬁed so that√without changing its message complexity, the algorithm requires no more than O( n) time (Exercise 3.10.57). The protocol we have described is tailored for square tori. If the torus is not square but rectangular with length l and width w (l ≤ w), then the algorithm can be adapted to use ⌰(n + l log l/w) messages (Exercise 3.10.58). Unoriented Torus The algorithm we just described solved the problem of electing a leader in an oriented torus, for example, among the buildings in Manhattan (well known for its mesh-like design), by sending a messenger along east-west streets and north-south avenues, turning at the appropriate corner. Consider now the same problem when the streets have no signs and the entities have no compass. Interestingly, the same strategy can be still used: A candidate needs to mark off a square; the orientation of the square is irrelevant. To be able to travel along a square, we just need to know how to 1. forward a message “in a straight line,” and 2. make the “appropriate turn.” We will discuss how to achieve each, separately. (1) Forwarding in a Straight Line. We ﬁrst consider how to forward a message in the direction opposite to the one from which the message was received, without knowing the directions. Consider an entity x, with its four incident links, and let a, b, c, and d be the arbitrary port numbers associated with them; (see Figure 3.38); to forward a message in a straight line, x needs to determine that a and d are opposite, and so are b and c. This can be easily accomplished by having each entity send its identity to each of its four neighbors, which will forward it to its three other neighbors; the entity will in turn acquire the identity and relative position of each entity at distance 2. As a result,

z

y a c

x

b

d

FIGURE 3.38: Even without a compass, x can determine which links are opposite.

166

ELECTION

x will know the two pairs of opposite port numbers. In the example of Figure 3.38, x will receive the message originating from z via both port a and port b; it, thus, knows that a is not opposite to b. It also receives the message from y via ports a and c; thus x knows also that a is not opposite to c. Then, x can conclude that a is opposite to d. It will then locally relabel one pair of opposite ports as east, west, and the other north, south; it does not matter which pair is chosen ﬁrst. (2) Making the Appropriate Turn. As a result of the the previous operation, each entity x knows two perpendicular directions, but the naming (north, south) and (east, west) might not be consistent with the one done by other entities. This can create problems when wanting to make a consistent turn. Consider a message, originating by x which is traveling “south” (according to x’s view of the torus); to continue to travel “south” can be easily accomplished as each entity knows how to forward a message in a straight line. At some point, according to the protocol, the message must turn, say to “east” (always according to x’s view of the torus), and continue in that direction. To achieve the turn correctly, we add a simple information, called handrail, to a message. The handrail is the id of the neighbor in the direction the message must turn and the name of the direction. In the example of Figure 3.38, if x is sending a message south that must then turn east, the handrail in the message will be the id of its eastern neighbor q plus the direction “east.” Because every entity knows the ids and the relative position of all the entities within distance 2, when y receives this message with the handrail from x, it can determine what x means by “east,” and thus in which direction the message must turn (when the algorithm prescribes it). Summarizing, even without a compass, we can execute the protocol MarkBorder, by adding the preprocessing phase and including the handrail information in the messages. The cost of the preprocessing is relatively small: Each entity receives four messages for its immediate neighbors and 4 × 3 for entities at distances 2, for a total of 16n messages.

3.5 ELECTION IN CUBE NETWORKS 3.5.1 Oriented Hypercubes The k-dimensional hypercube Hk , which we have introduced in Section 2.1.3, is a common interconnection network, consisting of n = 2k nodes, each with degree k; hence, in Hk there are m = k2k−1 = O(n log n) edges. In an oriented hypercube Hk , the port numbers 1, 2, . . . , k for the k edges incident on a node x are called dimensions and are assigned according to the “construction rules” specifying Hk (see Fig. 2.3). We will solve the election problem in oriented hypercubes using the approach electoral stages that we have developed for ring networks. The metaphor we will use is that of a fencing tournament: in a stage of the tournament, each candidate, called duelist, will be assigned another duelist, and each pair will have a match; as a result

ELECTION IN CUBE NETWORKS

167

of the match, one duelist will be promoted to the next stage, the other excluded from further competition. In each stage, only half of the duelists enter the next stage; at the end, there will be only one duelist that will become the leader and notify the others. Deciding the outcome of a match is easy: The duelist with the smaller id will win; for reasons that will become evident later, we will have the defeated duelist remember the shortest path to the winning duelist. The crucial and difﬁcult parts are how pairs of opposite duelists are formed and how a duelist ﬁnds its competitor. To understand how this can be done efﬁciently, we need to understand some structural properties of oriented hypercubes. A basic property of an oriented hypercube is that if we remove from Hk all the links with label greater than i (i.e., consider only the ﬁrst i dimensions), we are left with 2k−i disjoint oriented hypercubes of dimension i; denote the collection of these smaller cubes by Hk:i . For example, removing the links with label 3 and 4 from H4 will result into four disjoint oriented hypercubes of dimension 2 (see Figure 3.39 (a and b)). What we will do is to ensure that (I) at the end of stage i − 1, there will be only one duelist left in each of the oriented hypercubes of dimension i − 1 of Hk:i−1 . So, for example, at the end of stage 2, we want to have only one duelist left in each of the four hypercubes of dimension 2 (see Figure 3.39(c)). Another nice property of oriented hypercubes is that if we add to Hk:i−1 the links labeled i (and, thus, construct Hk:i ) the elements of Hk:i−1 will be grouped into pairs. We can use this property to form the pairs of duelists in each stage of the tournament: (II) A duelist x starting stage i will have as its opponent the duelist in the hypercube of dimension i − 1 connected to x by the link labeled i. Thus, in stage i, a duelist x will send a Match message to (and receive a Match message from) the duelist y in hypercube (of dimension i − 1) that is on the other side of link i. The Match message from x will contain the id id(x) (as well as the path traveled so far) and will be sent across dimension i (i.e., the link with label i). The entity z on the other end of the link might, however, not be the duelist y and might not even know who (and where) y is (Figure 3.40). We need the Match message from x to reach its opponent y. We can obtain this by having z broadcast the message in its (i − 1)-dimensional hypercube (e.g., using protocol HyperFlood presented in Section 2.1.3); in this way, we are sure that y will receive the message. Obviously, this approach is an expensive one (as determined in Exercise 3.10.59). To solve this problem efﬁciently, we will use the following observation. If node z is not the duelist (i.e., z = y), node z was defeated in a previous stage, say i1 < i; it knows the (shortest) path to the duelist zi1 , which defeated it in that stage, and can thus forward the message to it. Now, if zi1 = y, then we are done: The message from x has arrived and the match can take place. Otherwise, in a similar way, zi1 was

168

ELECTION

(a)

2

1 (b)

(c)

FIGURE 3.39: (a) The four-dimensional hypercube H4 , (b) the collection H4:2 of twodimensional hypercubes obtained by removing the links with labels greater than 2, and (c) duelists (in black) at the end of stage 2. z

y x

FIGURE 3.40: Each duelist (in black) sends a Match message that must reach its opponent.

ELECTION IN CUBE NETWORKS

169

defeated in some subsequent stage i2 , i1 < i2 < i; it, thus, knows the (shortest) path to the duelist zi2 , which defeated it in that stage and can thus forward the message to it. In this way, the message from x will eventually reach y; the path information in the message is updated during its travel so that y will know the dimensions traversed by the message from x to y in chronological order. The Match message from y will reach x with similar information. The match between x and y will take place both at x and y; only one of them, say x, will enter stage i + 1, while the other, y, is defeated. From now on, if y receives a Match message, it will forward it to x; as mentioned before, we need this to be done on the shortest path. How can y (the defeated duelist) know the shortest path to x (the winner)? The Match message y received from x contained the labels of a walk to it, not necessarily the shortest path. Fortunately, it is easy to determine the shortcuts in any path using the properties of the labeling. Consider a sequence α of labels (with or without repetitions); remove from the sequence any pair of identical labels and sort the remaining ones, obtaining a compressed sequence α. For example, if α = 231345212, then α = 245. The important property is that if we start from the same node x, the walk with labels α will lead to the same node y as the walk with labels α. The other important property is that α actually corresponds to the shortest path between x and y. Thus, y needs only to compress the sequence contained in the Match message sent by x. IMPORTANT. We can perform the compression while the message is traveling from x to y; in this way, the message will contain at most k labels. Finally, we must consider the fact that owing to different transmission delays, it is likely that the computation in some parts of the hypercube is faster than in others. Thus, it may happen that a duelist x in stage i sends a Match message for its opponent, but the entities on the other side of dimension i are still in earlier stages. So, it is possible that the message from x reaches a duelist y in an earlier stage j < i. What y should do with this message depends on future events that have nothing to do with the message: If y wins all matches in stages j, j + 1, . . . , i − 1, then y is the opponent of x in stage i, and it is the destination of the message; on the contrary, if it loses one of them, it must forward the message to the winner of that match. In a sense, the message from x has arrived “too soon”; so, what y will do is to delay the processing of this message until the “right” time, that is, until it enters stage i or it becomes defeated. Summarizing, 1. A duelist in stage i will send a Match message on the edge with label i. 2. When a defeated node receives a Match message, it will forward it to the winner of the match in which it was defeated. 3. When a duelist y in stage i receives a Match message from a duelist x in stage i, if id(x) > id(y), then y will enter stage i + 1, otherwise it will become defeated and compute the shortest path to x.

170

ELECTION

4. When a duelist y in stage j receives a Match message from a duelist x in stage i > j , y will enqueue the message and process it (as a newly arrived one) when it enters stage i or becomes defeated. The protocol terminates when a duelist wins the kth stage. As we will see, when this happens, that duelist will be the only one left in the network. The algorithm, protocol HyperElect, is shown in Figures 3.41 and 3.42. NextDuelist denotes the (list of labels on the) path from a defeated node to the duelist that defeated it. The Match message contains (Id*, stage*, source*, dest*), where Id* is the identity of the duelist x originating the message; stage* is the stage of this match; source* is (the list of labels on) the path from the duelist x to the entity currently processing the message; and dest* is (the list of labels on) the path from the entity currently processing the message to a target entity (used to forward message by the shortest path between a defeated entity and its winner). Given a list of labels list, the protocol uses the following functions: – ﬁrst(list) returns the ﬁrst element of the list; – list ⊕ i (respectively, ) updates the given path by adding (respectively, eliminating) a label i to the list and compressing it. To store the delayed messages, we use a set Delayed that will be kept sorted by stage number; for convenience, we also use a set delay of the corresponding stage numbers. Correctness and termination of the protocol derive from the following fact (Exercise 3.10.61): Lemma 3.5.1 Let id(x) be the smallest id in one of the hypercubes of dimension i in Hk:i . Then x is a duelist at the beginning of stage i + 1. This means that when i = k, there will be only one duelist left at the end of that stage; it will then become leader and notify the others so to ensure proper termination. To determine the cost of the protocol, we need to determine the number of messages sent in a stage i. For a defeated entity z, denote by w(z) its opponent (i.e., the one that won the match). For simplicity of notation, let wj (z) = w(wj −1 (z)) where w0 (z) = z. Consider an arbitrary H ∈ Hk:i−1 ; let y be the only duelist in H in stage i and let z be the entity in H that receives ﬁrst the Match message for y from its opponent. Entity z must send this message to y; it forwards the message (through the shortest path) to w(z), which will forward it to w(w(z)) = w2 (z), which will forward it to w(w2 (z)) = w3 (z), and so on, until wt (z) = y. There will be no more than i such “forward” points (i.e., t ≤ i); as we are interested in the worst case, assume this to be the case. Thus, the total cost will be the sum of all the distances between successive forward points, plus one (from x to z). Denote by d(j − 1, j ) the distance between wj −1 (z) and wj (z); clearly d(j − 1, j ) ≤ j (Exercise 3.10.60); then the total number of messages required for the Match message from a duelist x in stage i to reach its

ELECTION IN CUBE NETWORKS

PROTOCOL HyperElect.

States: S = {ASLEEP, DUELLIST, DEFEATED, FOLLOWER, LEADER}; SINIT = {ASLEEP}; STERM = {FOLLOWER, LEADER}.

Restrictions: IR ∪OrientedHypercube. ASLEEP

Spontaneously begin stage:= 1; delay:=0; value:= id(x); Source:= [stage]; Dest:= []; send("Match", value, stage, Source, Dest) to 1; become DUELLIST; end Receiving("Match", value*, stage*, Source*, Dest*) begin stage:= 1; value:= id(x); Source:= [stage]; Dest:= []; send("Match", value, stage, Source, Dest) to 1; become DUELLIST; if stage* =stage then PROCESS MESSAGE; else DELAY MESSAGE; endif end DUELLIST Receiving("Match", value*, stage*, Source*, Dest*) begin if stage* =stage then PROCESS MESSAGE; else DELAY MESSAGE; endif end DEFEATED Receiving("Match", value*, stage*, Source*, Dest*) begin if Dest* = [ ] then Dest*:= NextDuelist; endif l:=first(Dest*); Dest:=Dest* l; Source:= Source* ⊕l; send("Match", value*, stage*, Source, Dest) to l; end Receiving("Notify") begin send ("Notify") to {l ∈ N (x) : l > sender}; become FOLLOWER; end

FIGURE 3.41: Protocol HyperElect.

171

172

ELECTION

Procedure PROCESS MESSAGE begin if value* > value then if stage* =k then send ("Notify") to N (x); become LEADER; else stage:= stage+1; Source:=[stage] ; dest:= [ ]; send("Match", value, stage, Source, Dest) to stage; CHECK; endif else NextDuelist := Source; CHECK ALL; become DEFEATED; endif end

Procedure DELAY MESSAGE begin Delayed ⇐ (value*, stage*, Source*, Dest*); delay ⇐ stage*; end

Procedure CHECK begin if Delayed = ∅ then next:=Min{delay}; if next = stage then (value*, stage*, Source*, Dest*) ⇐ Delayed; delay:= delay-{next}; PROCESS MESSAGE endif endif end

Procedure CHECK ALL begin while Delayed = ∅ do (value*, stage*, Source*, Dest*) ⇐ Delayed; if Dest* [ ] then Dest*:= NextDuelist; endif l:=f irst(Dest*) ; Dest:=Dest* l ; Source:= Source* ⊕l send("Match", value*, stage*, Source, Dest) to l; endwhile end

FIGURE 3.42: Procedures used by Protocol HyperElect.

opposite y will be at most L(i) = 1 +

i−1

d(j − 1, j ) = 1 +

j =1

i−1 j =1

j =1+

i·(i−1) 2 .

Now we know how much does it cost for a Match message to reach its destination. What we need to determine is how many such messages are generated in each stage;

ELECTION IN CUBE NETWORKS

173

in other words, we want to know the number ni of duelists in stage i (as each will generate one such message). By Lemma 3.5.1, we know that at the beginning of stage i, there is only one duelist in each of the hypercubes H ∈ Hk:i−1 ; as there are exactly n = 2k−i+1 such cubes, 2i−1 ni = 2k−i+1 . Thus, the total number of messages in stage i will be

ni L(i) = 2k−i+1 1 +

i·(i−1) 2

and over all stages, the total will be k i=1

2k−i+1 1 +

i·(i−1) 2

= 2k

k i=1

i 2i−1

+

k i=1

i2 2i

+

k i=1

i 2i

= 6 2k − k 2 − 3k − 7.

As 2k = n, and adding the (n − 1) messages to broadcast the termination, we have M[HyperElect] ≤ 7n − (log n)2 − 3 log n − 7.

(3.35)

That is, we can elect a leader in less than 7n messages! This result should be contrasted with the fact that in a ring we need ⍀(n log n) messages. As for the time complexity, it is not difﬁcult to verify that protocol HyperFlood requires at most O(log3 N ) ideal time (Exercise 3.10.62). Practical Considerations The O(n) message cost of protocol HyperElect is achieved by having the Match messages convey path information in addition to the usual id and stage number. In particular, the ﬁelds Source and Dest have been described as lists of labels; as we only send compressed paths, Source and Dest contain at most log n labels each. So it would appear that the protocol requires “long” messages. We will now see that in practice, each list only requires log n bits (i.e., the cost of a counter). Examine a compressed sequence of edge labels α in Hk (e.g., α = 1457 in H8 ); as the sequence is compressed, there are no repetitions. The elements in the sequence are a subset of the integers between 1 and k; thus α can be represented as a binary string b1 , b2 , . . . , bk where each bit bj = 1 if and only if j is in α. Thus, the list α = 1457 in H8 is uniquely represented as 10011010. Thus, each of Source and Dest will be just a k = log n bits variable. This also implies that the cost in terms of bits of the protocol will be no more than B[HyperElect] ≤ 7n(log id + 2 log n + log log n), where the log log n component is to account for the stage ﬁeld.

(3.36)

174

ELECTION

3.5.2 Unoriented Hypercubes Hypercubes with arbitrary labellings obviously do not have the properties of oriented hypercubes. It is still possible to take advantage of the highly regular structure of hypercubes to do better than in ring networks. In fact (Problem 3.10.8), Lemma 3.5.2

M(Elect/IR; Hypercube) ≤ O(n log log n)

To date, it is not known whether it is possible to elect a leader in an hypercube in just O(n) messages even when it is not oriented (Problem 3.10.9).

3.6 ELECTION IN COMPLETE NETWORKS We have seen how structural properties of the network can be effectively used to overcome the additional difﬁculty of operating in a fully symmetric graph. For example, in oriented hypercubes, we have been able to achieve O(n) costs, that is, comparable to those obtainable in trees. In contrast, a ring has very few links and no additional structural property capable of overcoming the disadvantages of symmetry. In particular, it is so sparse (i.e., m = n) that it has the worst diameter among regular graphs (to reach the furthermost node, a message must traverse d = n/2 links) and no short cuts. It is thus no surprising that election requires ⍀(n log n) messages. The ring is the sparsest network and it is an extreme in the spectrum of regular networks. At the other end of the spectrum lies the complete graph Kn ; in Kn , each node is connected directly to every other node. It is thus the densest network m = 21 n(n − 1) and the one with smallest diameter d = 1. Another interesting property is that Kn contains every other network G as a subgraph! Clearly, physical implementation of such a topology is very expensive. Let us examine how to exploit such very powerful features to design an efﬁcient election protocol. 3.6.1 Stages and Territory To develop an efﬁcient protocol for election in complete networks, we will use electoral stages as well as a new technique, territory acquisition. In territory acquisition, each candidate tries to “capture” its neighbors (i.e., all other nodes) one at a time; it does so by sending a Capture message containing its id as well as the number of nodes captured so far (the stage). If the attempt is successful, the attacked neighbor becomes captured, and the candidate enters the next stage and

ELECTION IN COMPLETE NETWORKS

175

continues; otherwise, the candidate becomes passive. The candidate that is successful in capturing all entities becomes the leader. Summarizing, at any time an entity is candidate, captured, or passive. A captured entity remembers the id, the stage, and the link to its “owner” (i.e., the entity that captured it). Let us now describe an electoral stage. 1. A candidate entity x sends a Capture message to a neighbor y. 2. If y is candidate, the outcome of the attack depends on the stage and the id of the two entities: (a) If stage(x) > stage(y), the attack is successful. (b) If stage(x) = stage(y), the attack is successful if id(x) < id(y); otherwise x becomes passive. (c) If stage(x) < stage(y), x becomes passive. 3. If y is passive, the attack is successful. 4. If y is already captured, then x has to defeat y’s owner z before capturing y. Speciﬁcally, a Warning message with x’s id and stage is send by y to its owner z. (a) If z is a candidate in a higher stage, or in the same stage but with a smaller id than x, then the attack to y is not successful: z will notify y that, in turn, will notify x. (b) In all other cases (z is already passive or captured, z is a candidate in a smaller stage, or in the same stage but with a larger id than x), the attack to y is successful: z notiﬁes x via y, and if candidate it becomes passive. 5. If the attack is successful, y is captured by x, x increments stage(x) and proceeds with its conquest. Notice that each attempt from a candidate costs exactly two messages (one for the Capture, one for the notiﬁcation) if the neighbor is also a candidate or passive; instead, if the neighbor was already captured, two additional messages will be sent (from the neighbor to its owner, and back). The strategy just outlined will indeed solve the election problem (Exercise 3.10.65). Even though each attempt costs only four (or fewer) messages, the overall cost can be prohibitive; this is because of the fact that the number ni of candidates at level i can in general be very large (Exercise 3.10.66). To control the number ni , we need to ensure that a node is captured by at most one candidate in the same level. In other words, the territories of the candidates in stage i must be mutually disjoint. Fortunately, this can be easily achieved. First of all, we provide some intelligence and decisional power to the captured nodes: (I) If a captured node y receives a Capture message from a candidate x that is in a stage smaller than the one known to y, then y will immediately notify x that the attack is unsuccessful.

176

ELECTION

As a consequence, a captured node y will only issue a Warning for an attack at the highest level known to y. A more important change is the following: (II) If a captured node y sends a Warning to its owner z about an attack from x, y will wait for the answer from z (i.e., locally enqueue any subsequent Capture message in same or higher stage) before issuing another Warning. As a consequence, if the attack from x was successful (and the stage increased), y will send to the new owner x any subsequent Warning generated by processing the enqueued Capture messages. After this change, the territory of any two candidates in the same level are guaranteed to have no nodes in common (Exercise 3.10.64). Protocol CompleteElect implementing the strategy we have just designed is shown in Figures 3.43, 3.44, and 3.45. Let us analyze the cost of the protocol. How many candidates there can be in stage i? As each of them has a territory of size i and these territories are disjoint, there cannot be more than ni ≤ n/ i such candidates. Each will originate an attack that will cost at most four messages; thus, in stage i, there will be at most 4n/i messages. Let us now determine the number of stages needed for termination. Consider the following fact: if a candidate has conquered a territory of size n2 + 1, no other candidate can become leader. Hence, a candidate can become leader as soon as it reaches that stage (it will then broadcast a termination message to all nodes). Thus the total number of messages, including the n − 1 for termination notiﬁcation, will be n+1+

n/2

4ni ≤ n + 1 + 4n

i=1

n/2 i=1

1 i

= 4nHn/2 + n + 1,

which gives the overall cost M[CompleteElect] ≤ 2.76 n log n − 1.76n + 1.

(3.37)

Let us now consider the time cost of the protocol. It is not difﬁcult to see that in the worst case, the ideal time of protocol CompleteElect is linear (Exercise 3.10.67): T[CompleteElect] = O(n).

(3.38)

This must be contrasted with the O(1) time cost of the simple strategy of each entity sending its id immediately to all its neighbors, thus receiving the id of everybody else, and determining the smallest id. Obviously, the price we would pay for a O(1) time cost is O(n2 ) messages. Appropriately combining the two strategies, we can actually construct protocols that offer optimal O(n log n) message costs with O(n/ log n) time (Exercise 3.10.68). The time can be further reduced at the expense of more messages. In fact, it is possible to design an election protocol that, for any log n ≤ k ≤ n, uses O(nk) messages and O(n/k) time in the worst case (Exercise 3.10.69).

ELECTION IN COMPLETE NETWORKS

177

PROTOCOL CompleteElect.

S = {ASLEEP, CANDIDATE,PASSIVE, CAPTURED, FOLLOWER, LEADER}; SINIT = {ASLEEP}; STERM = {FOLLOWER, LEADER}.

Restrictions: IR ∪CompleteGraph. ASLEEP

Spontaneously begin stage:= 1; value:= id(x); Others:= N (x); next ← Others; send("Capture", stage, value) to next; become CANDIDATE; end Receiving("Capture", stage*, value*) begin send("Accept", stage*, value*) to sender; stage:= 1; owner:= sender; ownerstage:= stage* +1; become CAPTURED; end CANDIDATE Receiving("Capture", stage*, value*) begin if (stage* < stage) or ((stage* = stage) and (value* > value)) then send("Reject", stage) to sender; else send("Accept", stage*, value*) to sender; owner:= sender; ownerstage:= stage* +1; become CAPTURED; endif end Receiving("Accept", stage, value) begin stage:= stage+1; if stage ≥ 1 + n/2 then send("Terminate") to N(x); become LEADER; else next ← Others; send("Capture", stage, value) to next; endif end (CONTINUES ...)

FIGURE 3.43: Protocol CompleteElect (I).

3.6.2 Surprising Limitation We have just developed an efﬁcient protocol for election in complete networks. Its cost is O(n log n) messages. Observe that this is the same as we were able to do in ring networks (actually, the multiplicative constant here is worse).

178

ELECTION

CANDIDATE Receiving("Reject", stage*) begin become PASSIVE; end Receiving("Terminate") begin become FOLLOWER; end Receiving("Warning", stage*, value*) begin if (stage* < stage) or ((stage* = stage) and (value* > value)) then send("No", stage) to sender; else send("Yes", stage*) to sender; become PASSIVE; endif end PASSIVE Receiving("Capture", stage*, value*) begin if (stage* < stage) or ((stage* = stage) and (value* > value)) then send("Reject", stage) to sender; else send("Accept", stage*, value*) to sender; ownerstage:= stage* +1; owner:= sender; become CAPTURED; endif end Receiving("Warning", stage*, value*) begin if (stage* < stage) or ((stage* = stage) and (value* > value)) then send("No", stage) to sender; else send("Yes", stage*) to sender; endif end Receiving("Terminate") begin become FOLLOWER; end (CONTINUES ...)

FIGURE 3.44: Protocol CompleteElect (II).

Unlike rings, in complete networks, each entity has a direct link to all other entities and there is a total of O(n2 ) links. By exploiting all this communication hardware, we should be able to do better than in rings, where there are only n links, and where entities can be O(n) far apart.

ELECTION IN COMPLETE NETWORKS

179

CAPTURED Receiving("Capture", stage*, value*) begin if stage* < ownerstage then send("Reject", ownerstage) to sender; else attack:= sender; send("Warning", value*, stage*) to owner; close N (x) − {owner}; endif end Receiving("No", stage*) begin open N (x); send("Reject", stage*) to attack; end Receiving("Yes", stage*) begin ownerstage:= stage*+1; owner:= attack; open N (x); send("Accept", stage*, value*) to attack; end Receiving("Warning", stage*, value*) begin if (stage* < ownerstage) then send("No", ownerstage) to sender; else send("Yes", stage*) to sender; endif end Receiving("Terminate") begin become FOLLOWER; end

FIGURE 3.45: Protocol CompleteElect (III).

The most surprising result about complete networks is that in spite of having available the largest possible amount of connection links and a direct connection between any two entities, for election they do not fare better than ring networks. In fact, any election protocol will require in the worst case ⍀(n log n) messages, that is, Property 3.6.1 M(Elect/IR; K) = ⍀(n log n) To see why this is true, observe that any election protocol also solves the wake-up problem: To become defeated or leader, an entity must have been active (i.e., awake). This simple observation has dramatic consequences. In fact, any wake-up protocol requires at least .5n log n messages in the worst case (Property 2.2.5); thus, any Election protocol requires in the worst case the same number of messages.

180

ELECTION

This implies that as far as election is concerned, the very large expenses due to the physical construction of m = (n2 + n)/2 links are not justiﬁable as the same performance and operational costs can be achieved with only m = n links arranged in a ring.

3.6.3 Harvesting the Communication Power The lower bound we have just seen carries a very strong and rather surprising message for network development: in so far election is concerned, complete networks are not worth the large communication hardware costs. The facts that Election is a basic problem and its solutions are routinely used by more complex protocols makes this message even stronger. The message is surprising because the complete graph, as we mentioned, has the most communication links of any network and the shortest possible distance between any two entities. To overcome the limit imposed by the lower bound and, thus, to harvest the communication power of complete graphs, we need the presence of some additional tools (i.e., properties, restrictions, etc.). The question becomes: which tool is powerful enough? As each property we assume restricts the applicability of the solution, our quest for a powerful tool should be focused on the least restrictive ones. In this section, we will see how to answer this question. In the process, we will discover some intriguing relationships between port numbering and consistency and shed light on some properties of whose existence we already had an inkling in earlier section. We will ﬁrst examine a particular labeling of the ports that will allow us to make full use of the communication power of the complete graph. The ﬁrst step consists in viewing a complete graph Kn as a ring Rn , where any two nonneighboring nodes have been connected by an additional link, called chord. Assume that the label associated at x to link (x, y) is equal to the (clockwise) distance from x to y in the ring. Thus, each link in the ring is labeled 1 in the clockwise direction and n − 1 in the other. In general, if lx (x, y) = i, then ly (y, x) = n − i (see Figure 3.46); this labeling is called chordal. Let us see how election can be performed in a complete graph with such a labeling. First of all, observe the following: As the links labeled 1 and n − 1 form a ring, the entities could ignore all the other links and execute on this subnet an election protocol for rings, for example, Stages. This approach will yield a solution requiring 2n log n messages in the worst case, thus already improving on CompleteElect. But we can do better than that. Consider a candidate entity x executing stage i: It will send an election message each in both directions, which will travel along the ring until they reach another candidate, say y and z (see Figure 3.47). This operation will require the transmission of d(x, y) + d(x, z) messages. Similarly, x will receive the Election messages from both y and z, and decide whether it survives this stage or not, on the basis of the received ids.

ELECTION IN COMPLETE NETWORKS

1 4

4

1

3

2

181

4

2

3 2

3

2

3

1

3

2

4

1

4

1

FIGURE 3.46: A complete graph with chordal labeling. The links labeled 1 and 4 form a ring.

Now, in a complete graph, there exists a direct link between x and y, as well as between x and z; thus, a message from one to the other could be conveyed with only one transmission. Unfortunately, x does not know which of its n − 1 links connect it to y or to z; y and z are in a similar situation. In the example of Figure 3.47, x does not know that y is the node at distance 5 along the ring (in the clockwise direction), and thus the port connecting x to it is the one with label 5. If it did, those four defeated nodes in between them could be bypassed. Similarly, x does not know that z is at distance −3 (i.e., at distance 3 in the counterclockwise direction) and thus reachable through port n − 3. However, this information can be acquired. Assume that the Election message contains also a counter, initialized to one, which is increased by one unit by each node forwarding it. Then, a candidate receiving the Election message knows exactly which port label connects it to the originator of that message. In our example, the election message from y will have a counter equal to 5 and will arrive from link 1 (i.e., counterclockwise), while the message from z will

x n−3

5

z

y

FIGURE 3.47: If x knew d(x, y) and d(x, z), it could reach y and z directly.

182

ELECTION

have a counter equal to 3 and will arrive from link n − 1 (i.e., clockwise). From this information, x can determine that y can be reached directly through port 5 and z is reachable through link n − 3. Similarly, y (respective z) will know that the direct link to x is the one labeled n − 5 (respective 3). This means that in the next stage, these chords can be used instead of the corresponding segments of the ring, thus saving message transmissions. The net effect will be that in stage i + 1, the candidates will use the (smaller) ring composed only of the chords determined in the previous stage, that is, messages will be sent only on the links connecting the candidates of stage i, thus, completely bypassing all entities defeated in stage i − 1 or earlier. Assume in our example that x enters stage i + 1 (and thus both y and z are defeated); it will prepare an election message for the candidates in both directions, say u and v, and will send it directly to y and to z. As before, x does not know where u and v are (i.e., which of its links connect it to them) but, as before, it can determine it. The only difference is that the counter must be initialized to the weight of the chord: Thus, the counter of the Election message sent by x directly to y is equal to 5, and the one to z is equal to 3. Similarly, when an entity forwards the Election message through a link, it will add to the counter the weight of that link. Summarizing, in each stage, the candidates will execute the protocol in a smaller ring. Let R(i) be the ring used in stage i; initially R(1) = Rn . Using the ring protocol Stages in each stage, the number of messages we will be transmitting will be exactly 2(n(1) + n(2) + . . . + n(k)), where n(i) is the size of R(i) and k ≤ log n is the number of stages; an additional n − 1 messages will be used for the leader to notify the termination. Observe that all the rings R(2), . . . , R(k) do not have links in common (Exercise 3.10.70). This means that if we consider the graph G composed of all these rings, then the number of links m(G) of G is exactly m(G) = n(2) + . . . + n(k). Thus, to determine the cost of the protocol, we need to ﬁnd out the value of m(G). This can be determined in many ways. In particular, it follows from a very interesting property of those rings. In fact, each R(i) is “contained” in the interior of R(i + 1): All the links of R(i) are chords of R(i + 1), and these chords do not cross. This means that the graph G formed by all these rings is planar; that is, can be drawn in the plane without any edge crossing. A well known fact of planar graphs is that they are sparse, that is, they contain very few links: not more than 3(n − 2) (if you did not know it, now you do). This means that our graph G has m(G) ≤ 3n − 6. As our protocol, which we shall call Kelect-Stages, uses 2(n(1) + m(G)) + n messages in the worst case, and n(1) = n, we have M[Kelect–Stages] < 8n − 12. A less interesting but more accurate measurement of the message costs follows from observing that the nodes in each ring R(i) are precisely the entities that were candidates in stage i − 1; thus, n(i) = ni−1 . Recalling that ni ≤ 21 ni−1 , and as n1 = n,

ELECTION IN CHORDAL RINGS ()

we have n(1) + n(2) + . . . + n(k) ≤ n +

k−1 i=1

183

ni < 3n, which will give

M[Kelect–Stages] < 7n

(3.39)

Notice that if we were to use Alternate instead of Stages as ring protocol (as we can), we would use fewer messages (Exercise 3.10.72). In any case, the conclusion is that the chordal labeling allows us to ﬁnally harvest the communication power of complete graphs and do better than in ring networks.

3.7 ELECTION IN CHORDAL RINGS () We have seen how election requires ⍀(n log n) messages in rings and can be done with just O(n) messages in complete networks provided with chordal labeling. Interestingly, oriented rings and complete networks with chordal labeling are part of the same family of networks, known as loop networks or chordal rings. 3.7.1 Chordal Rings A chordal ring Cn d1 , d2 , ..., dk of size n and k-chord structure d1 , d2 , ..., dk , with d1 = 1, is a ring Rn of n nodes {p0 , p1 , ..., pn−1 }, where each node is also directly connected to the nodes at distance di and N − di by additional links called chords. The link connecting two nodes is labeled by the distance that separates these two nodes on the ring, that is, following the order of the nodes on the ring: Node pi is connected to the node pi+dj mod n through its link labeled dj (as shown in Figure 3.48). In particular, if the link between p and q is labeled d at p, this link is labeled n − d at q. Note that the oriented ring is the chordal ring Cn 1 where label 1 corresponds to “right,” and n − 1 to “left.” The complete graph with chordal labeling is the chordal

FIGURE 3.48: Chordal ring C11 1, 3.

184

ELECTION

ring Cn 1, 2, 3, · · · , n/2 In fact, rings and complete graphs are two extreme topologies among chordal rings. Clearly, we can exploit the techniques we designed for complete graph with chordal labeling to develop an efﬁcient election protocol for the entire class of chordal ring networks. The strategy is simple: 1. Execute an efﬁcient ring election protocol (e.g., Stages or Alternate) on the outer ring. As we did in Kelect, the message sent in a stage will carry a counter, updated using the link labels, that will be used to compute the distance between two successive candidates. 2. Use the chords to bypass defeated nodes in the next stage. Clearly, the more the distances can be “bypassed” by the chords, the more the messages we will be able to save. As an example, consider the chordal ring Cn 1, 2, 3, 4, ..., t, where every entity is connected to its distance-t neighborhood in the ring. In this case (Exercise 3.10.76), a leader can be elected with a number of messages not more than O n+

n t

log nt .

A special case of this class is the complete graph, where t = n/2; in it we can bypass any distance in a single “hop” and, as we know, the cost becomes O(n). Interestingly, we can achieve the same O(n) result with fewer chords. In fact, consider the chordal ring Cn 1, 2, 4, 8, ..., 2 log n/2 ; it is called double cube and k = log n. In a double cube, this strategy allows election with just O(n) messages (Exercise 3.10.78), like if we were in a complete graph and had all the links. At this point, an interesting and important question is what is the smallest set of links that must be added to the ring to achieve a linear election algorithm. The double cube indicates that k = O(log n) sufﬁces. Surprisingly, this can be signiﬁcantly further reduced (Problem 3.10.12); furthermore, in that case (Problem 3.10.13), the O(n) cost can be obtained even if the links have arbitrary labels. 3.7.2 Lower Bounds The class of chordal rings is quite large; it includes rings and complete graphs, and the cost of electing a leader varies greatly depending on the structure. For example, we have already seen that the complexity is ⌰(n log n) and ⌰(n) in those two extreme chordal rings. We can actually establish precisely the complexity of the election problem for the entire class of chordal rings Cnt = Cn 1, 2, 3, 4..., t. In fact, we have (Exercise 3.10.77)

n n M(Elect/I R; Cnt ) = ⍀ n + log . t t

(3.40)

UNIVERSAL ELECTION PROTOCOLS

185

Notice that this class includes the two extremes. In view of the matching upper bound (Exercise 3.10.76), we have Property 3.7.1 The message complexity of Elect in Cnt under IR is ⌰ n +

n t

log nt .

3.8 UNIVERSAL ELECTION PROTOCOLS We have so far studied in detail the election problem in speciﬁc topologies; that is, we have developed solution protocols for restricted classes of networks, exploiting in their design all the graph properties of those networks so as to minimize the costs and increase the efﬁciency of the protocols. In this process, we have learned some strategies and principles, which are, however, very general (e.g., the notion of electoral stages), as well as the use of known techniques (e.g., broadcasting) as modules of our solution. We will now focus on the main issue, the design of universal election protocols, that is, protocols that run in every network, requiring neither a priori knowledge of the topology of the network nor that of its properties (not even its size). In terms of communication software, such protocols are obviously totally portable, and thus highly desirable. We will describe two such protocols, radically different from each other. The ﬁrst, Mega-Merger, which constructs a rooted spanning tree, is highly efﬁcient (optimal in the worst case); the protocol is, however, rather complex in terms of both speciﬁcations and analysis, and its correctness is still without a simple formal proof. The second, Yo-Yo, is a minimum-ﬁnding protocol that is exceedingly simple to specify and to prove correct; its real cost is, however, not yet known. 3.8.1 Mega-Merger In this section, we will discuss the design of an efﬁcient algorithm for leader election, called Mega-Merger. This protocol is topology independent (i.e., universal) and constructs a (minimum cost) rooted spanning tree of the network. Nodes are small villages each with a distinct name, and edges are roads each with a different distance. The goal is to have all villages merge into one large megacity. A city (even a small village will be considered such) always tries to merge with the closest neighboring city. When merging, there are several important issues that must be resolved. First and foremost is the naming of the new city. The resolution of this issue depends on how far the involved cities have progressed in the merging process, that is, on the level they have reached and on whether the merger decision is shared by both cities. The second issue to be resolved during a merging is the decision of which roads of the new city will be serviced by public transports. When a merger occurs, the roads of the new city serviced by public transports will be the roads of the two cities already serviced plus only the shortest road connecting them.

186

ELECTION

Let us clarify some of these concepts and notions, as well as the basic rules of the game. 1. A city is a rooted tree; the nodes are called districts, and the root is also known as downtown. 2. Each city has a level and a unique name; all districts eventually know the name and the level of their city. 3. Edges are roads, each with a distinct distance (from a totally ordered set). The city roads are only those serviced by public transport. 4. Initially, each node is a city with just one district, itself, and no roads. All cities are initially at the same level. Note that as a consequence of rule (1), every district knows the direction (i.e., which of its links in the tree leads) to its downtown (Figure 3.49). 5. A city must merge with its closest neighboring city. To request the merging, a Let-us-Merge message is sent on the shortest road connecting it to that city. 6. The decision to request for a merger must originate from downtown and until the request is resolved, no other request can be issued from that city.

D(A)

FIGURE 3.49: A city is a tree rooted in its downtown.

UNIVERSAL ELECTION PROTOCOLS

187

7. When a merger occurs, the roads of the new city serviced by public transports will be the roads of the two cities already serviced plus the shortest road connecting them. Thus, to merge, the downtown of city A will ﬁrst determine the shortest link, which we shall call the merge link, connecting it to a neighboring city; once this is done, a Let-us-Merge is sent through that link; the message will contain information identifying the city, its level, and the chosen merge link. Once the message reaches the other city, the actual merger can start to take place. Let us examine the components of this entire process in some details. We will consider city A, denote by D(A) its downtown, by level(A) its current level, and by e(A) = (a, b) the merge link connecting A to its closest neighboring city; let B be such a city. Node b will be called the entry point of the request from A to B, and node a the exit point. Once the Let-us-Merge message from a in A reaches the district b of B, three cases are possible. If the two cities have the same level and each asks to merge with the other, we have what is called a friendly merger: The two cities merge into a new one; to avoid any conﬂict, the new city will have a new name and a new downtown, and its level is increased: 8. If level(A) = level(B) and the merge link chosen by A is the same as that chosen by B (i.e., e(A) = e(B)), then A and B perform a friendly merger. If a city asks a merger with a city of higher level, it will just be absorbed, that is, it will acquire the name and the level of the other city: 9. If level(A) < level(B), A is absorbed in B. In all other cases, the request for merging and, thus, the decision on the name are postponed : 10. If level(A) = level(B), but the merge link chosen by A is not the same as that chosen by B (i.e., e(A) = e(B)), then the merge process of A with B is suspended until the level of b’s city becomes larger than that of A. 11. If level(A) > level(B), the merge process of A with B is suspended: x will locally enqueue the message until the level of b’s city is at least as large as the one of A. (As we will see later, this case will never occur.) Let us see these rules in more details. Absorption The absorption process is the conclusion of a merger request sent by A to a city with a higher level (rule 9). As a result, city A becomes part of city

188

ELECTION

B acquiring the name, the downtown, and the level of B. This means that during absorption, (i) the logical orientation of the roads in A must be modiﬁed so that they are directed toward the new downtown (so rule (1) is satisﬁed); (ii) all districts of A must be notiﬁed of the name and level of the city they just joined (so rule (2) is satisﬁed). All these requirements can be easily and efﬁciently achieved. First of all, the entry point b will notify a (the exit point of A) that the outcome of the request is absorption, and it will include in the message all the relevant information about B (name and level). Once a receives this information, it will broadcast it in A; as a result, all districts of A will join the new city and know its name and its level. To transform A so that it is rooted in the new downtown is fortunately simple. In fact, it is sufﬁcient to logically direct toward B the link connecting a to b and to “ﬂip” the logical direction only of the edges in the path from the exit point a to the old downtown of A (Exercise 3.10.79), as shown in Figure 3.50. This can be done as follows: Each of the districts of B on the path from a to D(A), when it receives the broadcast from a, will locally direct toward B two links: the one from which the broadcast message is received and the one toward its old downtown.

D(A)

D(B)

a

b

FIGURE 3.50: Absorption. To make the districts of A be rooted in D(B), the logical direction of the links (in bold) from the downtown to the exit point of A has been “ﬂipped.”

Friendly Merger If A and B are at the same level in the merging process (i.e., level(A) = level(B)) and want to merge with each other (i.e., e(A) = e(B)), we have

UNIVERSAL ELECTION PROTOCOLS

189

a friendly merger. Notice that if this is the case, a must also receive a Let-us-Merge message from b. The two cities now become one with a new downtown, a new name, and an increased level: (i) The new downtown will be the one of a and b that has smaller id (recall that we are working under the ID restriction). (ii) The name of the new city will be the name of the new downtown. (iii) The level will be increased by one unit. Both a and b will independently compute the new name, level, and downtown. Then each will broadcast this information to its old city; as a result, all districts of A and B will join the new city and know its name and its level. Both A and B must be transformed so that they are rooted in the new downtown. As discussed in the case of absorption, it is sufﬁcient to “ﬂip” the logical direction only of the edges in the path from the a to the old downtown of A, and of those in the path from b to the old downtown of B (Figure 3.51). Suspension In two cases (rules (10) and (11)), the merge request of A must be suspended: b will then locally enqueue the message until the level of its city is such that it can apply rule (8) or (9). Notice that in case of suspension, nobody from city A knows that their request has been suspended; because of rule (6), no other request can be launched from A. Choosing the Merging Edge According to rule (6), the choice of the merging edge e(A) in A is made by the downtown D(A); according to rule (5), e(A) must be the shortest road connecting A to a neighboring city. Thus, D(A) needs to ﬁnd the minimum length among all the edges incident on the nodes of the rooted tree A; this will be done by implementing rule (5) as follows: (5.1) Each district ai of A determines the length di of the shortest road connecting it to another city (if none goes to another city, then di = ∞). (5.2) D(A) computes the smallest of all the di . Concentrate on part (5.1) and consider a district ai ; it must ﬁnd among its incident edges the shortest one that leads to another city. IMPORTANT. Obviously, ai does not need to consider the internal roads (i.e., those that connect it to other districts of A). Unfortunately, if a link is unused, that is, no message has been sent or received through it, it is impossible for ai to know if this road is internal or leads to a neighboring city (Figure 3.52). In other words, ai must also try the internal unused roads.

190

ELECTION

D(A)

D(B)

a

b

(a)

a

b

(b)

FIGURE 3.51: Friendly merger. (a) The two cities have the same level and choose the same merge link. (b) The new downtown is the exit node (a or b) with smallest id.

Thus, ai will determine the shortest unused edge e, prepare a Outside? message, send it on e, and wait for a reply. Consider now the district c on the other side of e, which receives this message; c knows the name(C) and the level(C) of its city (which could, however, be changing).

UNIVERSAL ELECTION PROTOCOLS

191

D(A)

FIGURE 3.52: Some unused links might lead back to the city.

If name(A) = name(C) (recall that the message contains the name of A), c will reply Internal to ai , the road e will be marked as internal (and no longer used in the protocol) by both districts, and ai will restart its process to ﬁnd the shortest local unused edge. If name(A) = name(C), it does not necessarily mean that the road is not internal. In fact, it is possible that while c is processing this message, its city C is being absorbed by A. Observe that in this case, level(C) must be smaller than level(A) (because by rule (8) only a city with smaller level will be absorbed). This means that if name(A) = name(C) but level(C) ≥ level(A), then C is not being absorbed by A, and C is for sure a different city; thus, c will reply External to ai , which will have, thus, determined what it was looking for: di = length(e). The only case left is when name(A) = name(C) and level(C) < level(A), the case in which c cannot give a sure answer. So, it will not: c will postpone the reply until the level of its city becomes greater than or equal to that of A. Note that this means that the computation in A is suspended until c is ready. NOTE. As a consequence of this last case, rule (11) will never be applied (Exercise 3.10.80). In conclusion to determine if a link is internal should be simple, but, due to concurrency, the process is neither trivial nor obvious. Concentrate on part (5.2). This is easy to accomplish; it is just a minimum ﬁnding in a rooted tree, for which we can use the techniques discussed in Section 2.6.7. Speciﬁcally, the entire process is composed of a broadcast of a message informing all districts in the city of the current name and level (i) of the city, followed by a covergecast. Issues and Details We have just seen in details the process of determining the merge link as well as the rules governing a merger. Because of the asynchronous

192

ELECTION

nature of the system and its unpredictable (though ﬁnite) communication delays, it will probably be the case that different cities and districts will be at different levels at the same time. In fact, our rules take explicitly into account the interaction between neighboring cities at different levels. There are a few situations where the application of the rules will not be evident and thus require a more detailed treatment. (I) Discovering a friendly merger We have seen that when the Let-us-Merge message from A to B arrives at b, if level(A) = level(B), the outcome will be different (friendly merger or postponement) depending on whether e(A) = e(B) or not. Thus, to decide if it is a friendly merger, b needs to know both e(A) and e(B). When the Let-us-Merge message sent from a arrives to b, it knows e(A) = (a, b). Question. How does b know e(B)? The answer is interesting. As we have seen, the choice of e(B) is made by the downtown D(B), which will forward the merger request message of B towards the exit point. If e(A) = e(B), b is the exit point and, thus, it will eventually receive the message to be sent to a; then (and only then) b will know the answer to the question, and that it is dealing with a friendly merger. If e(A) = e(B), b is not the exit point. Note that, unless b is on the way from downtown D(B) to the exit point, b will not even know what e(B) is. Thus, what really happens when the Let-us-Merge message from A arrives at b, is the following. If b has received already a Let-us-Merge message from its downtown to be sent to a, then b knows that is a friendly merger; also a will know when it receives the request from b. (Note for hackers: thus, in this case, no reply to the request is really necessary.) Otherwise b does not know; thus it waits: if it is a friendly merger, sooner or later the message from its downtown will arrive and b will know; if B is requesting another city, eventually the level of b’s city will increase becoming greater than level(A) (which, as A is still waiting for the reply, cannot increase), and thus result in A being absorbed. (II) Overlapping discovery of an internal link In the merge-link calculation, when the Outside? message from a in A is sent to neighbor b in B, if name(A) = name(B) then the link (a, b) is internal and should be removed from consideration by both a and b. As b knows (it just found out receiving the message) but a possibly does not, b will send to a the reply Internal. However, if b also had sent to a an Outside? message, when a receives that message, it will ﬁnd out that (a, b) is internal, and the Internal reply would be redundant. In other words, if a and b from the same city independently send to each other an Outside? message, there is no need for either of them to reply Internal to the other. (III) Interaction between absorption and link calculation A situation that requires attention is due to the interaction between merge-link calculation and absorption. Consider the Let-us-Merge message sent by a on merge

UNIVERSAL ELECTION PROTOCOLS

193

link e(A) = (a, b) to b, and let level(A) = j < i = level(B); thus, A will have to be absorbed in B. Suppose that, when b receives the message, it is computing the merge link for its city B; as its level is i, we will call it the i-level merge link. What b will do in this case, is to ﬁrst proceed with the absorption of A (so to involve it in the i-level merge-link computation), and then to continue its own computation of the merge link. More precisely, b will start the broadcast in A of the name and level of B asking the districts there to participate in the computation of the i-level merge link for B, and then resume its computation. Suppose instead that b has already ﬁnished computing the i-level merge link for its city B; in this case, b will broadcast in A the name and level of B (so to absorb A), but without requesting them to participate in the computation of the i-level merge link for B (it is too late). (IV) Overlap between notiﬁcation and i-level merge-link calculation As mentioned, the i-level merge-link calculation is started by a broadcast informing all districts in the city of the current name and level (i) of the city. Let us call “startnext" the function provided by these messages. Notice that broadcasts are already used following the discovery of a friendly merger or an absorption. Consider the case of a friendly merger. When the two exit points know that it is a friendly merger, the notiﬁcation they broadcast will inform all districts in the merged city of the new level, new name, and to start computing the next merge link. In other words, the notiﬁcation is exactly the “start next” broadcast. In the case of an absorption, as we just discussed, a “start-next” broadcast is needed only if it is not too late for the new districts to participate in the current calculation of the merge link. If it is not too late, the notiﬁcation message contains the request to participate in the next merge-link calculation; thus, it is just the propagation of the current “start-next” broadcast in this new part of the city. In other words, the “notiﬁcation” broadcasts act as “start-next” broadcasts, if needed. 3.8.2 Analysis of Mega-Merger A city only carries out one merger request at a time, but it can be asked concurrently by several cities, which in turn can be asked by several others. Some of these requests will be postponed (because the level is not right, or the entry node does not (yet) know what the answer is, etc.) Due to communication delays, some districts will be taking decisions on the basis of the information (level and name of its city) that is obsolete. It is not difﬁcult to imagine very intricate and complex scenarios that can easily occur. How do we know that, in spite of concurrency and postponements and communication delays, everything will eventually work out? How can we be assured that some decisions will not be postponed forever, that is, there will not be deadlock? What guarantees that, in the end, the protocol terminates and a single leader will be elected? In other words, how do we know that the protocol is correct?

194

ELECTION

Because of its complexity and the variety of scenarios that can be created, there is no satisfactory complete proof of the correctness of the Mega-Merger protocol. We will discuss here a partial proof that will be sufﬁcient for our learning purposes. We will then analyze the cost of the Protocol. Finally, we will discuss the assumption of having distinct lengths associated to the links, examine some interesting connected properties, and then remove the assumption. Progress and Deadlock We will ﬁrst discuss the progress of the computation and the absence of deadlock. To do so, let us pinpoint the cases when the activity of a city C is halted by a district d of another city D. This can occur only when computing the merge edge, or when requesting a merger on the merge edge e(C); more precisely, there are three cases: (i) When computing the merge edge, a district c of C sends the Outside? message to d and D has a smaller level than C. (ii) A district c of C sends the Let-us-Merge message on the merge edge e(C) = (c, d); D and C have the same level but it is not a friendly merger. (iii) A district c of C sends the Let-us-Merge message on the merge edge e(C) = (c, d); D and C have the same level and it is a friendly merger, but d does not know yet. In cases (i) and (ii), the activities of C are suspended and will be resolved (if the protocol is correct) only in the “future,” that is, after D changes level. Case (iii) is different in that it will be resolved within the “present” (i.e., in this level); we will call this case a delay rather than a suspension. Observe that if there is no suspension, there is no problem. Property 3.8.1 If a city at level l will not be suspended, its level will eventually increase (unless it is the megacity). To see why this is true, consider the operations performed by a city C at a level l: Compute the merge edge and send a merge request on the merge edge. If it is not suspended, its merge request arrives at a city D with either a larger level (in which case, C is absorbed and its level becomes level(D)) or the same level and same merge edge (the case in which the two cities have a friendly merger and their level increases). So, only suspensions can create problems, but not necessarily so. Property 3.8.2 Let city C at level l be suspended by a district d in city D. If the level of the city of D becomes greater than l, C will no longer be suspended and its level will increase. This is because once the level of D becomes greater than the level of C, d can answer the Outside? message in case (i), as well as the Let-us-Merge message in case (ii). Thus, the only real problem is the presence of a city suspended by another whose level will not grow. We are now going to see that this cannot occur.

UNIVERSAL ELECTION PROTOCOLS

195

Consider the smallest level l of any city at time t, and concentrate on the cities C operating at that level at that time. Property 3.8.3 No city in C will be suspended by a city at higher level. This is because for a suspension to exist, the level of D can not be greater than the level of C (see the cases above). Thus, if a city C ∈ C is suspended, it is for some other city C ∈ C. If C is not suspended at level l, its level will increase; when that happens, C will no longer be suspended. In other words, there would be no problems as long as there are no cycles of suspensions within C, that is, as long as there is no cycle C0 , C1 , . . . , Ck−1 of cities of C where Ci is suspended by Ci+1 (and the operation on the indices are modulo k). The crucial property is the following: Property 3.8.4 There will be no cycles of suspensions within C. The proof of this property is based heavily on the fact that each edge has a unique length (we have assumed that.) and that the merge edge e(C) chosen by C is the shortest of all the unused links incident on C. Remember this fact and let us proceed with the proof. By contradiction, assume that the property is false. That is, assume there is a cycle C0 , C1 , . . . , Ck−1 of cities of C where Ci is suspended by Ci+1 (the operation on the indices are modulo k). First of all observe that as all these cities are at the same level, the reason they are suspended can only be that each is involved in an “unfriendly” merger, that is, case (ii). Let us examine the situation more closely: Each Ci has chosen a merge edge e(Ci ) connecting it to Ci+1 ; thus, Ci is suspending Ci−1 and is suspended by Ci+1 . Clearly, both e(Ci−1 ) and e(Ci ) are incident on Ci . By deﬁnition of merging edge (recall what we said at the beginning of the proof), e(Ci ) is shorter than e(Ci−1 ) (otherwise Ci would have chosen it instead); in other words, the length di of the road e(Ci ) is smaller than the length di11 of e(Ci+1 ). This means that d0 > d1 > . . . > dk−1 , but as it is a circle of suspensions, Ck−1 is suspended by C0 , that is, dk−1 > d0 . We have reached a contradiction, which implies that our assumption that the property does not hold is actually false; thus, the property is true. As a consequence of the property, all cities in C will eventually increase their level: ﬁrst, the ones involved in a friendly merger, next those that had chosen them for a merger (and thus absorbed by them), then those suspended by the latter, and so on. This implies that at no time there will be deadlock and there is always progress: Use the properties to show that the ones with smallest level will increase their value; when this happens, again the ones with smallest level will increase it, and so on. That is, Property 3.8.5 Protocol Mega-Merger is deadlock free and ensures progress. Termination We have just seen that there will be no deadlock and that progress is guaranteed. This means that the cities will keep on merging and eventually the

196

ELECTION

megacity will be formed. The problem is how to detect that this has happened. Recall that no node has knowledge of the network, not even of its size (it is not part of the standard set of assumptions for election); how does an entity ﬁnds out that all the nodes are now part of the same city? Clearly, it is sufﬁcient for just one entity to determine termination (as it can then broadcast it to all the others). Fortunately, termination detection is simple to achieve; as one might have suspected, it is the downtown of the megacity that will determine that the process is terminated. Consider the downtown D(A) of city A, and the operations it performs: It coordinates the computation of the merge link and then originates a merge request to be sent on that link. Now, the merge link is the shortest road going to another city. If A is already the megacity, there are no other cities; hence all the unused links are internal. This means that when computing the merge link, every district will explore every unused link left and discover that each one of them is internal; it will thus choose ∞ as its length (meaning that it does not have any outgoing links). This means that the minimum-ﬁnding process will return ∞ as the smallest length. When this happens, D(A) understands that the mega-merger is completed, and can notify all others. (Notiﬁcation is not really necessary: Exercise 3.10.81.) As the megacity is a rooted tree with the downtown as its root, D(A) becomes the leader; in other words, Property 3.8.6 Protocol Mega-Merger correctly elects a leader. Cost In spite of the complexity of protocol Mega-Merger, the analysis of its cost is not overly difﬁcult. We will ﬁrst determine how many levels there can be and then calculate the total number of messages transmitted by entities at a given level. The Number of Levels A district acquires a larger level because its city has been either absorbed or involved in a friendly merger. Notice that when there is absorption, only the districts in one of the two cities increase their level, and thus the max level in the system will not be increased. The max level can only increase after a friendly merger. How high can the max level be ? We can ﬁnd out by linking the minimum number of districts in a city to the level of the city. Property 3.8.7 A city of level i has at least 2i districts. This can be proved easily by induction. It is trivially true at the beginning (i.e., i = 0). Let it be true for 0 ≤ i ≤ k − 1. A level k city can only be created by a friendly merger of two level k − 1 cities; hence, by inductive hypothesis, such a city will have at least 2 2k−1 = 2k districts; thus the property is true also for i = k. As a consequence, Property 3.8.8 No city will reach a level greater than log n.

UNIVERSAL ELECTION PROTOCOLS

197

The Number of Messages per Level Consider a level i; some districts will reach this level from level i − 1 or even lower; others might never reach it (e.g., because of absorption, they move from a level lower than i directly to one larger than i). Consider only those districts that do reach level i and let us count how many messages they transmit in this level. In other words, as each message contains the level, we need to determine how many messages are sent in which the level is i. We do know that every district (except the downtown) of a city of level i receives a broadcast message informing it that its current level is i, and to start computing the i-level merge-link (this last part may not be included). Hence at most every district will receive such a message, accounting for a total of n messages. If the received broadcast also requests to compute the i-level edge-merge link, a district must ﬁnd its shortest outgoing link, by using Outside? messages. IMPORTANT. For the moment, we will not consider the Outside? messages sent to internal roads (i.e., where the reply is Internal); they will be counted separately later. In this case, the district will send at most one Outside? message that causes a reply External. The district will then participate in the convergecast, sending one message toward the downtown. Hence, all these activities will account for a total of at most 3n messages. Once the i-level merge-links have been determined, the Let-us-Merge messages are originated and sent to and across the merge-links. Regardless of the ﬁnal outcome of the request, the forwarding of the i-level Let-us-Merge message from the downtown D(A) to the new city through the merge edge e(A) = (a, b) will cause at most n(A) transmissions in a city A with n(A) districts (n(A) − 1 internal and one on the merge edge). This means that these activities will cost in total at most n(A) ≤ n A∈City(i)

messages where City(i) is the set of the cities reaching level i. This means that excluding the number of level i messages Outside? whose reply is Internal, the total number of messages sent in level i is Property 3.8.9 Cost(i) ≤ 5n The Number of Useless Messages In the calculation so far we have excluded the Outside? messages whose reply was Internal. These messages are in a sense “useless” as they do not bring about a merger; but they are also unavoidable. Let us measure their number. On any such road there will be two messages, either the Outside? message and the Internal reply, or two Outside? messages. So, we only need to determine the number of such roads. These roads are not part of the city (i.e., not serviced by public transport). As the ﬁnal city is a tree, the total number of the publicly serviced roads is exactly n − 1. Thus, the total number of the other roads is exactly m − (n − 1). This means that the total number of useless messages will be Property 3.8.10 Useless = 2(m − n + 1)

198

ELECTION

The Total Combining Properties 3.8.8, 3.8.9, and 3.8.10, we obtain the total number of messages exchanged in total by protocol Mega-Merger during all its levels of execution. To these, we need to add the n − 1 messages because of the downtown of the megacity broadcasting termination (eventhough these could be saved: Exercise 3.10.81), for a total of M[Mega – Merger] ≤ 2m + 5n log n + n + 1.

(3.41)

Road Lengths and Minimum-Cost Spanning Trees In all the previous discussions we have made some nonstandard assumptions about the edges. We have in fact assumed that each link has a value, which we called length, and that those values are unique. The existence of link values is not uncommon. In fact, dealing with networks, usually there is a value associated with a link denoting, for example, the cost of using that link, the transmission delays incurred when sending a message through it, and so forth. In these situations, when constructing a spanning tree (e.g., to use for broadcasting), the prime concern is how to construct the one of minimum cost, that is, where the sum of the values of its link is as small as possible. For example, if the value of the link is the cost of using it, a minimum-cost spanning tree is one where broadcasting would be the cheapest (regardless of who is the originator of the broadcast). Not surprisingly, the problem of constructing a minimum-cost spanning tree is important and heavily investigated. We have seen that protocol Mega-Merger constructs a rooted spanning tree of the network. What we are going to see now is that this tree is actually the unique minimumcost spanning tree of the network. We are also going to see how the nonstandard assumptions that we have made about the existence of unique lengths can be easily removed. Minimum-Cost Spanning Trees In general, a network can have several minimumcost spanning trees. For example, if all links have the same value (or have no value), then every spanning tree is minimal. By contrast, Property 3.8.11 If the link values are distinct, a network has a unique minimum-cost spanning tree. Assuming that there are distinct values associated to the links, protocol MegaMerger constructs a rooted spanning tree of the network. What we are going to see now is that this tree is actually the unique minimum-cost spanning tree of the network. To see why this is the case, we must observe a basic property of the minimum-cost spanning tree T . A fragment of T is a subtree of T . Property 3.8.12 Let A be a fragment of T, and let e be the link of minimum value among those connecting A to other fragments; let B be the fragment connected by A. Then the tree composed by merging A and B through e is also a fragment of T.

UNIVERSAL ELECTION PROTOCOLS

199

This is exactly what the Mega-Merger protocol does: It constructs the minimumcost spanning tree T (the megacity) by merging fragments (cities) through the appropriate edges (merge link). Initially, each node is a city and, by deﬁnition, a single node is a fragment. In general, each city A is a fragment of T ; its merge link is chosen as the shortest (i.e., minimum value) link connecting A to any neighboring city (i.e., fragment); hence, by Property 3.8.12, the result of the merger is also a fragment. Notice that the correctness of the process depends crucially on Property 3.8.11, and thus on the distinctness of the link values. Creating Unique Lengths We will now remove the assumptions that there are values associated to the links and these values are unique. If there are no values (the more general setting), then a unique value can be easily given to each link using the fact that the nodes have unique ids: To link e = (a, b) associate the sorted pair d(e) = Min{id(a), id(b)}, Max{id(a), id(b)} and use the lexicographic ordering to determine which edge has smaller length. So, for example, the link between nodes with ids 17 and 5 will have length 5, 17, which is smaller than 6, 5 but greater than 4, 32. To do this requires, however, that each node knows the id of all its neighbors. This information can be acquired in a preprocessing phase, in which every node sends to its neighbors, its id (and will receive theirs from them); the cost will be two additional messages on each link. Thus, even if there are no values associated to the links, it is possible to use protocol Mega-Merger. The price we have to pay is 2m additional messages. If there are values but they are not (known to be) unique, they can be made so, again using the fact that the nodes have unique ids. To link e = (a, b) with value v(e) associate the sorted triple d(e) = v(e), Min{id(a), id(b)}, Max{id(a), id(b)}. Thus, links with the same values will now be associated to different lengths. So, for example, the link between nodes with ids 17 and 5 and value 7 will have length 7, 5, 17, which is smaller than 7, 6, 5 but greater than 7, 4, 32. Also, in this case, each node needs to know the id of all its neighbors. The same preprocessing phase will achieve the goal with only 2m additional messages. Summary Protocol Mega-Merger is a universal protocol that constructs a (minimum-cost) spanning tree and returns it rooted in a node, thus electing a leader. If there are no initial distinct values on the links, a preprocessing phase needs to be added, in which each entity exchanges its unique id with its neighbors; then the actual execution of the protocol can start. The total cost of the protocol (with or without preprocessing phase) is O(m + n log n), which, we will see, is worst case optimal. The main drawback of Mega-Merger is its design complexity, which makes any actual implementation difﬁcult to verify. 3.8.3 YO-YO We will now examine another universal protocol for leader election. Unlike the previous one, it has simple speciﬁcations, and its correctness is simple to establish. This protocol, called YO-YO, is a minimum-ﬁnding algorithm and consists of two parts: a preprocessing phase and a sequence of iterations. Let us examine them in detail.

200

ELECTION

Setup In the preprocessing phase, called Setup, every entity x exchanges its id with its neighbors. As a result, it will receive the id of all its neighbors. Then, x will logically orient each incident link (x, y) in the direction of the entity (x or y), with the largest id. So, if id(x) = 5 and its neighbor y has id(y) = 7, x will orient (x, y) toward y; notice that y will also do the same. In fact, the orientation of each link will be consistent at both end nodes. so obtained. There is a very simple but important Consider now the directed graph G property: is acyclic. Property 3.8.13 G To see why this is true, consider by contradiction the existence of a directed cycle x0 , x1 , . . . , xk ; this means that id(x0 ) < id(x1 ) < . . . < id(xk−1 ) but, as it is a cycle, id(xk−1 ) < id(x0 ), which is impossible. is a directed acyclic graph (DAG). In a DAG, there are three This means that G types of nodes: is a node – source is a node where all the links are out-edges; thus, a source in G with an id smaller than that of all its neighbors, that is, it is a local minimum; is a node whose – sink is a node where all the links are in-edges; thus, a sink in G id is larger than that of all its neighbors, that is, it is a local maximum; – internal node is a node, which is neither a source nor a sink. As a result of the setup, each node will know whether it is a source, a sink, or an internal node. We will also use the terminology of “down” referring to the direction toward the sinks, and “up” referring to the direction toward the sources (see Figure 3.53). Once this preprocessing is completed, the second part of the algorithm start. As YO-YOs is a minimum-ﬁnding protocol, only the local minima (i.e., the sources) will be the candidates (Figure 3.54). Iteration The core of the protocol is a sequence of iterations. Each iteration acts as an electoral stage in which some of the candidates are removed from consideration. Each iteration is composed of two parts, or phases, called YO- and -YO. YO- This phase is started by the sources. Its purpose is to propagate to each sink the smallest among the values of the sources connected to that sink (see Figure 3.54(a)). 1. A source sends its value down to all its out-neighbors. 2. An internal node waits until it receives a value from all its in-neighbors. It then computes the minimum of all received values and sends it down to its out-neighbors. 3. A sink waits until it receives a value from all its in-neighbors. It then computes the minimum of all received values and starts the second part of the iteration. 3

In the sense that there is a directed path from the source to that sink.

201

UNIVERSAL ELECTION PROTOCOLS

5

7

9

12

8

2

1

6

10

11

3

22

15

28

26

13

16

4

17

(a)

2

3

11

5

1

8

7

6

10

12 4

16

9 22

15

28

26

13

17

(b)

FIGURE 3.53: In the Setup phase, (a) the entities know their neighbors’ ids and (b) orient each incident link toward the smaller id, creating a DAG.

-YO This phase is started by the sinks. Its purpose is to eliminate some candidates, transforming some sources into sinks or internal nodes. This is done by having the sinks inform their connected sources of whether or not the id they sent is the smallest seen so far (see Figure 3.54(b)). 4. A sink sends YES to all in-neighbors from which the smallest value has been received. It sends NO to all the others. 5. An internal node waits until it receives a vote from all its out-neighbors. If all votes are YES, it sends YES to all in-neighbors from which the smallest value

202

ELECTION

2 2

5 2

2

1

5

6

1

5

1

5 2

2

2

6

1 1

1

2

2

6

2

2

2 2

2

2 2

2

(a)

2 Y

5 Y

Y

NO

1 NO

Y

Y

Y

Y Y

NO

Y

Y Y

Y

NO

NO Y

6

Y

Y

Y Y

Y Y Y

Y

(b)

FIGURE 3.54: In the Iteration stage, only the candidates are sources. (a) In the YO- phase, the ids are ﬁltered down to the sinks. (b) In the -YO phase, the votes percolate up to the sources.

has been received and NO to all the others. If at least a vote was NO, it sends NO to all its in-neighbors. 6. A source waits until it receives a vote from all its out-neighbors. If all votes are YES, it survives this iteration and starts the next one. If at least a vote was NO, it is no longer a candidate. Before the next iteration can be started, the directions on the links in the DAG must be modiﬁed so that only the sources that are still candidate (i.e., those that received only YES) will still be sources; clearly, the modiﬁcation must be done

UNIVERSAL ELECTION PROTOCOLS

2

5

1

203

6

(a)

2

1

5 6

(b)

FIGURE 3.55: (a) In the -YO phase, we ﬂip the logical direction of the links on which a NO is sent, (b) creating a new DAG, where only the surviving candidates will be sources.

without creating cycles. In other words, we must transform the DAG into a new one, whose only sources are the undefeated ones in this iteration. This modiﬁcation is fortunately simple to achieve. We need only to “ﬂip” the direction of each link where a NO vote is sent (see Figure 3.55(a)). Thus, we have two meta-rules for the -YO part: 7. When a node x sends NO to an in-neighbor y, it will reverse the (logical) direction of that link (thus, y becomes now an out-neighbor of x). 8. When a node y receives NO from an out-neighbor x, it will reverse the (logical) direction of that link (thus, x becomes now an in-neighbor of y).

204

ELECTION

As a result, any source that receives a NO will cease to be a source; it can actually become a sink. Some sinks may cease to be such and become internal nodes, and some internal nodes might become sinks. However, no sink or internal node will ever become a source (Exercise 3.10.83). A new DAG is, thus, created, where the sources are only those that received all YES in this iteration (see Figure 3.55(b)). Once a node has completed its part in the -YO phase, it will know whether it is a source, a sink, or an internal node in the new DAG. The next iteration could start now, initiated by the sources of the new DAG. Property 3.8.14 Applying an iteration to a DAG with more than one source will result into a DAG with fewer sources. The source with smallest value will still be a source. In each iteration, some sources (at least one) will be no longer sources; in contrast to this, the source with the smallest value will be eventually the only one left under consideration. In other words, eventually the DAG will have a single source (the overall minimum, say c), and all other nodes are either sinks or internal nodes. How can c determine that it is the only source left, and thus it should become the leader? If we were to perform an iteration now, only c’s value will be sent in the YO- phase, and only YES votes will be sent in the -YO phase. The source c will receive only YES votes; but c has received only YES votes in every iteration it has performed (that is why it survived as a source). How can c distinguish that this time is different, that the process should end? Clearly, we need some additional mechanisms during the iterations. We are going to add some meta-rules, called Pruning, which will allow to reduce the number of messages sent during the iterations, as well as to ensure that termination is detected when only one source is left. Pruning The purpose of pruning is to remove from the computation, nodes and links that are “useless,” do not have any impact on the result of the iteration; in other words, if they were not there, still the same result would be obtained: The same sources would stay sources, and the others defeated. Once a link or a node is declared “useless,” during the next iterations it will be considered nonexistent and, thus, not used. Pruning is achieved through two meta-rules. The ﬁrst meta-rule is a structural one. To explain it, recall that the function of the sinks is to reduce the number of sources by voting on the received values. Consider now a sink that is a leaf (i.e., it has only one in-neighbor); such a node will receive only one value; thus it can only vote YES. In other words, a sink leaf can only agree with the choice (i.e., the decision) made by its parent (i.e., its only neighbor). Thus, a sink leaf is “useless.” 9. If a sink is a leaf (i.e., it has only one in-neighbor), then it is useless; it then asks its parent to be pruned. If a node is asked to prune an out-neighbor, it will do so by declaring useless (i.e., removing from consideration in the next iterations) the connecting link.

UNIVERSAL ELECTION PROTOCOLS

5

5

8

5

205

8

FIGURE 3.56: Rules of pruning.

Notice that after pruning a link, a node might become a sink; if it is also a leaf, then it becomes useless. The other meta-rule is geared toward reducing the communication of redundant information. During YO- phase, a (internal or sink) node might receive the value of the same source from more than one in-neighbor; this information is clearly redundant as, to do its job (choose the minimum received value), it is enough for the node to receive just one copy of that value. Let x receive the value of source s from in-neighbors x1 , . . . , xk , k > 1. This means that in the DAG, there are directed paths from s to (at least) k distinct in-neighbors of x. This also means that if the link between x and one of them, say x1 , did not exist, the value from s would still arrive to x from those other neighbors, x2 , . . . , xk . In fact, if we had removed the links between x and all those in-neighbors except one, x would still have received the value of s from that neighbor. In other words, the links between x and x1 , . . . , xk are redundant: It is sufﬁcient to keep one; all others are useless and can be pruned. Notice that the choice regarding the link that should be kept is irrelevant. 10. If in the YO- phase, a node receives the same value from more than one inneighbor, it will ask all of them except one to prune the link connecting them and it will declare those links useless. If a node receives such a request, it will declare useless (i.e., remove from consideration in the next iterations) the connecting link. Notice that after pruning a link because of rule (10), a sink might become a leaf and thus useless (by rule (9)) (see Figure 3.57).

206

ELECTION

2 2

5 2

2

1

5

6

1

5

1

5 2

2

2

6

1 1

1

2 2

6

2

2

2 2

2

2 2

2

(a) 2

5 Y

1 NO

NO

Y

6 Y

NO NO

NO Y

Y

(b)

FIGURE 3.57: The effects of pruning in the ﬁrst iteration: Some nodes (in black) and links are removed from consideration.

The pruning rules require communication: In rule (7), a sink leaf needs to ask its only neighbor to declare the link between them useless; in rule (8), a node receiving redundant information needs to ask some of its neighbors to prune the connecting link. We will have this communication take place during the -YO phase: The message containing the vote will also include the request, if any, to declare that link useless. In other words, pruning is performed when voting. Let us return now on our concern on how to detect termination. As we will see, the pruning operations, integrated in the -YO phase, will do the trick. To understand how and why, consider the effect of performing a full iteration (with pruning) on a DAG with only one source.

UNIVERSAL ELECTION PROTOCOLS

2

207

1 1

2

1

2

2

1

2

1

1 1

5

6

(a)

2

1 Y

NO

NO

Y

(b)

FIGURE 3.58: The effects of pruning in the second iteration: Other nodes (in black) and links are removed from consideration.

Property 3.8.15 If the DAG has a single source, then, after an iteration, the new DAG is composed of only one node, the source. In other words, when there is a single source c, all other nodes will be removed, and c will be the only useful node left. This situation will be discovered by c when, because of pruning, it will have no neighbors (Figure 3.59). Costs The general formula expressing the costs of protocol YO-YO is easy to establish; however, the exact determination of the costs expressed by the formula is still an open research problem. Let us derive the general formula. In the Setup phase, each node sends its value to all its neighbors; hence, on each link there will be two messages sent, for a total of 2m messages.

208

ELECTION

1

1 1 1

1

1

(a)

(b)

FIGURE 3.59: The effects of pruning in the third iteration: Termination is detected as the source has no more neighbors in the DAG.

Consider now an iteration. In the YO- stage, every useful node (except the sinks) sends a message to its out-neighbors; hence, on each link still under consideration, there will be exactly one message sent. Similarly, in the -YO stage, every useful node (except the sources) sends a message to its in-neighbors; hence, on each link there will be again only one message sent. Thus, in total in iteration i there will be exactly 2mi messages, where mi is the number of links in the DAG used at stage i. The notiﬁcation of termination from the leader can be performed by broadcasting on the constructed spanning tree with only n − 1 messages. Hence, the total cost will be 2

k(G)

mi + n − 1,

i=0

where m0 = m and k(G) is the total number of iterations on network G. be the We need now to establish the number of iterations k(G). Let D(1) = G original DAG obtained from G as a result of setup. Let G(1) be the undirected graph deﬁned as follows: There is a node for each source in D(1) and there is a link between two nodes if and only if the two corresponding sources have a sink in common. Consider now the diameter d(G(1)) of this graph. Property 3.8.16 The number of iteration is at most log diam(G(1)) + 1. To see why this is the case, consider any two neighbors a and b in G(1). As, by deﬁnition, the corresponding sources in D(1) have a common sink, at least one of these two sources will be defeated (because the sink will vote YES to only one of them). This means that if we take any path in G(1), at least half of the nodes on that path will correspond to sources that will cease to be such at the end of this iteration. 4

In a DAG, two sources a and b are said to have a common sink c if c is reachable from both a and b.

UNIVERSAL ELECTION PROTOCOLS

209

Furthermore, if (the source corresponding to) a survives, it will now have a sink in common with each of the undefeated (sources corresponding to) neighbors of b. This means that if we consider the new DAG D(2), the corresponding graph G(2) is exactly the graph obtained by removing the nodes associated to the defeated sources, and linking together the nodes previously at length two. In other words, d(G(2)) ≤

d(G(1))/2. Similar will be the relationship between the graphs G(i − 1) and G(i) corresponding to the DAG D(i − 1) of iteration i − 1 and to the resulting new DAG D(i), respectively. In other words, d(G(i)) ≤ d(G(i − 1))/2. Observe that diam(G(i)) = 1 corresponds to a situation where all sources except one will be defeated in this iteration, and d(G(i)) = 0 corresponds to the situation where there is only one source left (which does not know it yet). As d(G(i)) ≤ 1 after at most log diam(G(1)) iterations, the property follows: As the diameter of a graph cannot be greater than the number of its nodes, and as we have that the nodes of G(1) correspond to the sources of G, ≤ log n . k(G) ≤ log s(G) We can thus establish that without pruning, that is, with mi = m, we have a O(m log n) total cost M[Yo – Yo (without pruning)] ≤ 2 m log n + l.o.t.

(3.42)

The unsolved problem is the determination of the real cost of the algorithm, when the effects of pruning are taken into account. 3.8.4 Lower Bounds and Equivalences We have seen a complex but rather efﬁcient protocol, MegaMerger, for electing a leader in an arbitrary network. In fact, it uses O(m + n log n) messages in the worst case. This means that in a ring network it uses O(n log n) messages and it is thus optimal, without even knowing that the network is a ring. The next question we should ask is how efﬁcient a universal election protocol can be. In other words, what is the complexity of the election problem? The answer is not difﬁcult to derive. First of all observe that any election protocol requires to send a message on every link. To see why this is true, assume by contradiction that indeed there is a correct universal election protocol A that in every network G and in every execution under IR does not send a message on every link of G. Consider a network G and an execution of A in G; let z be the entity that becomes leader and let e = (x, y) ∈ E be a link where no message is transmitted by A (Figure 3.60(a)).

210

ELECTION

H G

G’

G’’ x’

x

a

a z

x’’ a

z’

z’’

b y

b

b

y’

(a)

y’’

(b)

FIGURE 3.60: Every universal election protocol must send messages on every link.

We will now construct a new graph H as follows: We make two copies of G and remove from both of them the edge e; we then connect these two graphs G and G by adding two new edges e1 = (x , x ) and e2 = (y , y ), where x and x (respective y and y ) are the copies of x (respective y) in G and G , respectively, and where the labels are: lx (e1 ) = lx (e1 ) = lx (e) and ly (e1 ) = ly (e2 ) = ly (e) (see Figure 3.60(b)). Run exactly the same execution of A we did in G on the two components G and G of H : As no message was sent along (x, y) in G, this is possible, but as no message was sent along (x, y) in the original execution, x and x will never send messages to each other in the current execution; similarly, y and y will never send messages to each other. This means that the entities of G will never communicate with the entities of G during this execution; thus, they will not be aware of their existence and will operate solely within G ; similarly for the entities of G . This means that when the execution of A in G terminates, entity z will become leader; but similarly, entity z in G will become leader as well. In other words, two leaders will be elected, contradicting the correctness of protocol A. In other words, M(Elect /IR) ≥ m. This lower bound is powerful enough to provide us with interesting and useful information; for example, it states that ⍀(n2 ) messages are needed in a complete graph if you do not know that is a complete graph. By contrast, we know that there are networks where election requires way more than m messages; for example, in rings m = n but we need ⍀(n log n) messages. As a universal election protocol must run in every network, including rings, we can say that in the worst case, M(Elect/IR) ≥ ⍀(m + n log n).

(3.43)

UNIVERSAL ELECTION PROTOCOLS

211

This means that protocol MegaMerger is the worst case optimal and we know the complexity of the election problem. Property 3.8.17 The message complexity of election under IR is ⌰(m + n log n). We are now going to see that constructing a spanning tree SPT and electing a leader Elect are strictly equivalent: Any solution to one of them can be easily modiﬁed so as to solve the other with the same message cost (in order of magnitude). First of all, observe that , similarly to the Election problem, SPT also requires a message to be sent on every link (Exercise 3.10.85): M(SPT/IR) ≥ m.

(3.44)

We are now going to see how we can construct a spanning-tree construction algorithm from any existing election protocol. Let A be an election protocol; consider now the following protocol B: 1. Elect a leader using A. 2. The leader starts the execution of protocol Shout. Recall that protocol Shout (seen in Section 2.5) will correctly construct a spanning tree if there is a unique initiator. As the leader elected in step (1) is unique, a spanning tree will be constructed in step (2). So, protocol B solves SPT. What is the cost ? As Shout uses exactly 2m messages, we have M[B] = M[A] + 2m. In other words, with at most O(m) additional messages, any election protocol can be made to construct a spanning tree; as ⍀(m) messages are needed anyway (Equation 3.44), this means that M(SPT/IR) ≤ M(Elect/IR).

(3.45)

Focus now on a spanning-tree construction algorithm C. Using C as the ﬁrst step, it is easy to construct an election protocol D where (Exercise 3.10.86) M[D] = M[C] + O(n). In other words, the message complexity of Elect is no more than that of Elect plus at most another O(n) messages; as election requires more than O(n) messages anyway (Property 3.8.17), this means that M(Elect/IR) ≤ M(SPT/IR).

(3.46)

212

ELECTION

Combining Equations 3.45 and 3.46, we have not only that the problems are computationally equivalent Elect(I R) ≡ SPT(I R)

(3.47)

but also that they have the same complexity: M(Elect/IR) = M(SPT/IR).

(3.48)

Using similar arguments, it is possible to establish the computational and complexity equivalence of election with several other problems (e.g., see Exercise 3.10.87).

3.9 BIBLIOGRAPHICAL NOTES Election in a ring network is one of the ﬁrst problems studied in distributed computing from an algorithmic point of view. The ﬁrst solution protocol, All the Way, is due to Gerard Le Lann [29] proposal for unidirectional rings. Also for unidirectional rings, protocol AsFar was developed by Ernie Chang and Rosemary Roberts [12]; it was later analyzed experimentally by Friedman Mattern [34] and analytically by Christian Lavault [31]. The probabilistic bidirectional version ProbAsFar was proposed and analyzed by Ephraim Korach, Doron Rotem, and Nicola Santoro [28]. Hans Bodlaender and Jan van Leeuwen later showed how to make it deterministic and provided further analysis [8]; the exact asymptotic average value has been derived by Christian Lavault [30]. The idea beyond the ﬁrst ⌰(n log n) worst-case protocol, Control, is due to Dan Hirschberg and J.B. Sinclair [22]. Protocol Stages was designed by Randolph Franklin [17]; the more efﬁcient Stages with Feedback was developed by Ephraim Korach, Doron Rotem, and Nicola Santoro [27]. The ﬁrst ⌰(n log n) worst case protocol for unidirectional rings, UniStages, was designed by Danny Dolev, Maria Klawe, and Michael Rodeh [15]. The more efﬁcient MinMax is due to Gary Peterson [39]. The even more efﬁcient protocol MinMax+ has been designed by Lisa Higham and Theresa Przytycka [21]. Bidirectional versions of MinMax with the same complexity as the original (Problem 3.10.4) have been independently designed by Shlomo Moran, Mordechai Shalom, and Shmuel Zaks [35], and by Jan van Leeuwen and Richard Tan [44]. The lower bound for unidirectional rings is due to Jan Pachl, Doron Rotem, and Ephraim Korach [36]. James Burns developed the ﬁrst lower bound for bidirectional rings [9]. The lower bounds when n is known (Exercises 3.10.45 and 3.10.47), as well as others, are due to Hans Bodlaender [5–7]. The O(n) election protocol for tori was designed by Gary Peterson [38] and later reﬁned for unoriented tori by Bernard Mans [33].

BIBLIOGRAPHICAL NOTES

213

The quest for a O(n) election protocol for hypercubes with dimensional labelings was solved independently by Steven Robbins and Kay Robbins [40], Paola Flocchini and Bernard Mans [16], and Gerard Tel [43]. Stefan Dobrev [13] has designed a protocol that allows O(n) election in hypercubes with any sense of direction, not just the dimensional labeling (Exercise 3.10.63). The protocol for unoriented hypercubes has been designed by Stefan Dobrev and Peter Ruzicka [14]. The ﬁrst optimal ⌰(n log n) protocol for complete networks was developed by Pierre Humblet [23]; an optimal protocol that requires O(n) messages on the average (Exercise 3.10.74) was developed by Mee Yee Chan and Francis Chin [10]. The lower bound is due to Ephraim Korach, Shlomo Moran, and Shmuel Zaks [26], who also designed another optimal protocol. The optimal protocol CompleteElect, reducing the O(n log n) time complexity to O(n), was designed by Yeuda Afek and Eli Gafni [2]; the same bounds were independently achieved by Gary Peterson [38]. The time complexity has been later reduced to O( logn n ) without increasing the message costs (Exercise 3.10.68) by Gurdip Singh [42]. The fact that a chordal labeling allows to fully exploit the communication power of the complete graph was observed by Michael Loui, Teresa Matsushita, and Douglas West, who developed the ﬁrst O(n) protocol for such a case [32]. Stefan Dobrev [13] has designed a protocol that allows O(n) election in complete networks with any sense of direction, not just the chordal labeling (Exercise 3.10.75). Election protocols for chordal rings, including the doublecube, were designed and analyzed by Hagit Attiya, Jan van Leeuwen, Nicola Santoro, and Shmuel Zaks [3]. The quest for the smallest cord structure has seen k being reduced from O(log n) ﬁrst to O(log log n) by T.Z. Kalamboukis and S.L. Mantzaris [24], then to O(log log log n) by Yi Pan [37], and ﬁnally to O(1) (Problem 3.10.12) by Andreas Fabri and Gerard Tel [unpublished]. The observation that in such a chordal ring, election can be done in O(n) messages even if the links are arbitrarily labeled (Problem 3.10.13) is due to Bernard Mans [33]. The ﬁrst O(m + n log n) universal election protocol was designed by Robert Gallager [18]. Some of the ideas developed there were later used in MegaMerger, developed by Robert Gallager, Pierre Humblet, and Philip Spira, that actually constructs a min-cost spanning tree [19]. The O(n log n) time complexity of MegaMerger has been reduced ﬁrst to O(n log∗ n) by Mee Yee Chan and Francis Chin [11] and then to O(n) (Problem 3.10.14) by Baruch Awerbuch [4] without increasing the message complexity. It has been further reduced to ⌰(d) (Problem 3.10.15) by Hosame AbuAmara and Arkady Kanevsky but at the expense of a O(m log d) message cost [1]; the same reduction has been obtained independently by Juan A. Garay, Shay Kutten, and David Peleg [20]. Protocol YO-YO was designed by Nicola Santoro ; the proof that it requires at most O(log n) stages is due to Gerard Tel. The computational relationship between the traversal and the election problems has been discussed and analyzed by Ephraim Korach, Shay Kutten, and Shlomo Moran [25]. The ⍀(m + n log n) lower bound for universal election as well as some of the other computational equivalence relationships were ﬁrst observed by Nicola Santoro [41].

214

ELECTION

3.10 EXERCISES, PROBLEMS, AND ANSWERS 3.10.1 Exercises Exercise 3.10.1 Modify protocol MinF-Tree (presented in Section 2.6.2) so as to implement strategy Elect Minimum Initiator in a tree. Prove its correctness and analyze its costs. Show that, in the worst case, it uses 3n + k − 4 ≤ 4n − 4 messages. Exercise 3.10.2 Design an efﬁcient single-initiator protocol to ﬁnd the minimum value in a ring. Prove its correctness and analyze its costs. Exercise 3.10.3 Show that the time costs of protocol All the Way will be at most 2n − 1. Determine also the minimum cost and the condition that will cause it. Exercise 3.10.4 Initiator.

Modify protocol All the Way so to use strategy Elect Minimum

Exercise 3.10.5 Modify protocol AsFar so to use strategy Elect Minimum Initiator. Determine the average number of messages assuming that any subset of k∗ entities is equally likely to be the initiators. Exercise 3.10.6 Expand the rules of protocol Stages described in Section 3.3.4, so as to enforce message ordering. Exercise 3.10.7 Show that in protocol Stages, there will be at most one enqueued message per closed port. Exercise 3.10.8 Prove that in protocol Stages with Feedback, the minimum distance between two candidates in stage i is d(i) ≥ 2i−1 . Exercise 3.10.9 Show an initial conﬁguration for n = 8 in which protocol Stages will require the most messages. Describe how to construct the “worst conﬁguration” for any n. Exercise 3.10.10 Determine the ideal time complexity of protocol Stages. Exercise 3.10.11 Modify protocol Stages using the min-max approach discussed in Section 3.3.7. Prove its correctness. Show that its message costs are unchanged. Exercise 3.10.12 Write the rules of protocol Stages* described in Section 3.3.4. Exercise 3.10.13 Assume that in Stages* candidate x in stage i receives a message M∗ with stage j > i. Prove that if x survives, then id(x) is smaller not only of id∗ but also of the ids in the messages “jumped over” by M∗. Exercise 3.10.14 Show that protocol Stages* correctly terminates.

EXERCISES, PROBLEMS, AND ANSWERS

215

Exercise 3.10.15 Prove that the message and time costs of Stages* are no worse that those of Stages. Produce an example in which the costs of Stages* are actually smaller. Exercise 3.10.16 Write the rules of protocol Stages with Feedback assuming message ordering. Exercise 3.10.17 Derive the ideal time complexity of protocol Stages with Feedback. Exercise 3.10.18 Write the rules of protocol Stages with Feedback enforcing message ordering. Exercise 3.10.19 Prove that in protocol Stages with Feedback, the number of ring segments where no feedback will be transmitted in stage i is ni+1 . Exercise 3.10.20 Prove that in protocol Stages with Feedback, the minimum distance between two candidates in stage i is d(i) ≥ 3i−1 . Exercise 3.10.21 Give a more accurate estimate of the message costs of protocol Stages with Feedback. Exercise 3.10.22 Show an initial conﬁguration for n = 9 in which protocol Stages with Feedback will require the most stages. Describe how to construct the “worst conﬁguration” for any n. Exercise 3.10.23 Modify protocol Stages with Feedback using the min-max approach discussed in Section 3.3.7. Prove its correctness. Show that its message costs are unchanged. Exercise 3.10.24 Implement the alternating step strategy under the same restrictions and with the same cost of protocol Alternate but without closing any port. Exercise 3.10.25 Determine initial conﬁgurations that will force protocol Alternate to use k steps when n = Fk . Exercise 3.10.26 Show that the worst case number of steps of protocol Alternate is achievable for every n > 4. Exercise 3.10.27 Determine the ideal time complexity of protocol Alternate. Exercise 3.10.28 Modify protocol Alternate using the min-max approach discussed in Section 3.3.7. Prove its correctness. Show that its message costs are unchanged. Exercise 3.10.29 Show the step-by-step execution of Stages and of UniStages in the ring of Figure 3.3. Indicate for each step, the values know at the candidates.

216

ELECTION

Exercise 3.10.30 Determine the ideal time complexity of protocol UniStages. Exercise 3.10.31 Modify protocol UniStages using the min-max approach discussed in Section 3.3.7. Prove its correctness. Show that its message costs are unchanged. Exercise 3.10.32 Design an exact simulation of Stages with Feedback for unidirectional rings. Analyze its costs. Exercise 3.10.33 Show the step-by-step execution of Alternate and of UniAlternate in the ring of Figure 3.3. Indicate for each step, the values know at the candidates. Exercise 3.10.34 Without changing its message cost, modify protocol UniAlternate so that it does not require Message Ordering. Exercise 3.10.35 Prove that the ideal time complexity of protocol UniAlternate is O(n). Exercise 3.10.36 Modify protocol UniAlternate using the min-max approach discussed in Section 3.3.7. Prove its correctness. Show that its message costs are unchanged. Exercise 3.10.37 Prove that in protocol MinMax, if a candidate x survives an even stage i, its predecessor l(i, x) becomes defeated. Exercise 3.10.38 Show that the worst case number of steps of protocol MinMax is achievable. Exercise 3.10.39 Modify protocol MinMax so that it does not require Message Ordering. Implement your modiﬁcation and throughly test your implementation. Exercise 3.10.40 For protocol MinMax, consider the conﬁguration depicted in Figure 3.32. Prove that once envelope (11, 3) reaches the defeated node z, z can determine that 11 will survive this stage. Exercise 3.10.41 Write the rules of Protocol MinMax+ assuming message ordering. Exercise 3.10.42 Write the rules of Protocol MinMax+ without assuming message ordering. Exercise 3.10.43 Prove Property 3.3.1. Exercise 3.10.44 Prove that in protocol MinMax+, if an envelope with value v reaches an even stage i + 1, it saves at least Fi messages in stage i with respect to MinMax (Hint: Use Property 3.3.1.).

EXERCISES, PROBLEMS, AND ANSWERS

217

Exercise 3.10.45 Prove that even if the entities know n, aveA (I |n known) ≥ ( 41 − ) n log n for any election protocol A for unidirectional rings. Exercise 3.10.46 Prove that in bidirectional rings, aveA (I ) ≥ protocol A.

1 2

nHn for any election

Exercise 3.10.47 Prove that even if the entities know n, aveA (I |n known) ≥ 21 n log n for any election protocol A for unidirectional rings. Exercise 3.10.48 Determine the exact complexity of Wake-Up in a mesh of dimensions a × b. Exercise 3.10.49 Show how to broadcast from a corner of a mesh dimensions a × b with less than 2n messages. Exercise 3.10.50 In Protocol ElectMesh, in the ﬁrst stage of the election process, if an interior node receives an election message, it will reply to the sender “I am in the interior,” so that no subsequent election messages are sent to it. Explain why it is possible to achieve the same goal without sending those replies. Exercise 3.10.51 Consider the following simple modiﬁcation to Protocol ElectMesh: When sending a wake-up message, a node includes the information of whether it is an internal, a border, or a corner node. Then, during the ﬁrst stage of the election, a border node uses this information if possible to send the election message only along the outer ring (it might not be possible.). Show that the protocol so modiﬁed uses at most 4(a + b) + 5n + k − 32 messages. Exercise 3.10.52 Broadcasting in Oriented Mesh. Design a protocol that allows to broadcast in an oriented mesh using n − 1 messages regardless of the location of the initiator. Exercise 3.10.53 Traversal in Oriented Mesh. Design a protocol that allows to traverse an oriented mesh using n − 1 messages regardless of the location of the initiator. Exercise 3.10.54 Wake-Up in Oriented Mesh. Design a protocol that allows to wake-up all the entities in an oriented mesh using less than 2n messages regardless of the location and the number of the initiators. Exercise 3.10.55 Show that the effect of rounding up α i does not affect the order of magnitude of the cost of Protocol MarkBorder derived in Section 3.4.2 (Hint: Show that it amounts to at most eight extra messages per candidate per stage with an insigniﬁcant change in the bound on the number of candidates in each stage).

218

ELECTION

Exercise 3.10.56 Show that the ideal time of protocol MarkBorder can be as bad as O(n). Exercise 3.10.57 Improving √ Time in Tori () Modify Protocol MarkBorder so that the time complexity is O( n) without increasing the message complexity. Ensure that the modiﬁed protocol is correct. Exercise 3.10.58 Election in Rectangular Torus () Modify Protocol MarkBorder so that it elects a leader in a rectangular torus of dimension l × w (l ≤ w), using ⌰(n + l log l/w) messages. Exercise 3.10.59 Determine the cost of electing a leader in an oriented hypercube if in protocol HyperElect the propagation of the Match messages is done by broadcasting in the appropriate subcube instead of “compressing the address.” Exercise 3.10.60 Prove that in protocol HyperElect the distance d(j − 1, j ) between wj −1 (z) and wj (z) is at most j . Exercise 3.10.61 Prove Lemma 3.5.1, that is, that during the execution of protocol HyperElect, the only duelists in stage i are the entities with the smallest id in one of the hypercubes of dimension i − 1 in Hk:i−1 . Exercise 3.10.62 O(log3 N).

Show that the time complexity of Protocol HyperFlood is

Exercise 3.10.63 () Prove that it is possible to elect a leader in a hypercube using O(n) messages with any sense of direction (Hint: Use long messages). Exercise 3.10.64 Prove that in the strategy CompleteElect outlined in Section 3.6.1, the territories of any two candidates in the same stage have no nodes in common. Exercise 3.10.65 Prove that the strategy CompleteElect outlined in Section 3.6.1 solves the election problem. Exercise 3.10.66 Determine the cost of the strategy CompleteElect described in Section 3.6.1 in the worst case (Hint: Consider how many candidates there can be at level i). Exercise 3.10.67 Analyze the ideal time cost of protocol CompleteElect described in Section 3.6.1. Exercise 3.10.68 Design an election protocol for complete graphs that, like CompleteElect, uses O(n log n) messages but uses only O(n/ log n) time in the worst case.

EXERCISES, PROBLEMS, AND ANSWERS

219

Exercise 3.10.69 Generalize the answer to Exercise 3.10.68. Design an election protocol for complete graphs that, for any log n ≤ k ≤ n, uses O(nk) messages and O(n/k) time in the worst case. Exercise 3.10.70 Prove that all the rings R(2), . . . , R(k) where messages are sent by protocol Kelect do not have links in common. Exercise 3.10.71 Write the code for, implement, and test protocol Kelect-Stages. Exercise 3.10.72 () Consider using the ring protocol Alternate instead of Stages in Kelect. Determine what will be the cost in this case. Exercise 3.10.73 () Stages.

Determine the average message costs of protocol Kelect-

Exercise 3.10.74 () Show how to elect a leader in a complete network with O(n log n) messages in the worst case but only O(n) on the average. Exercise 3.10.75 () Prove that it is possible to elect a leader in a complete graph using O(n) messages with any sense of direction. Exercise how to elect a leader in the chordal ring Cn 1, 2, 3, 4..., t 3.10.76 Show with O n + nt log nt messages. Exercise 3.10.77 Prove that in chordal ring Cnt electing a leader requires at least n n ⍀ n + t log t messages in the worst case (Hint: Reduce the problem to that of electing a leader on a ring of size n/t). Exercise 3.10.78 Show how to elect a leader in the double cube Cn 1, 2, 4, 8..., 2 log n with O(n) messages. Exercise 3.10.79 Consider a merger message from city A arriving at neighbouring city B along merge link (a, b) in protocol Mega-Merger. Prove that if we reverse the logical direction of the links on the path from D(A) to the exit point a and direct toward B the merge link, the union of A and B will be rooted in the downtown of A. Exercise 3.10.80 District b of B has just received a Let-us-Merge message from a along merge link (a, b). From the message, b ﬁnds out that level(A) > level(B); thus, it postpones the request. In the meanwhile, the downtown D(B) chooses (a, b) as its merge link. Explain why this situation will never occur. Exercise 3.10.81 Find a way to avoid notiﬁcation of termination by the downtown of the megacity in protocol Mega-Merger (Hint: Show that by the time the downtown understands that the mega-merger is completed, all other districts already know that their execution of the protocol is terminated).

220

ELECTION

Exercise 3.10.82 Time Costs. Show that protocol Mega-Merger uses at most O(n log n) ideal time units. Exercise 3.10.83 Prove that in the YO-YO protocol, during an iteration, no sink or internal node will become a source. Exercise 3.10.84 Modify the YO-YO protocol so that upon termination, a spanning tree rooted in the leader has been constructed. Achieve this goal without any additional messages. Exercise 3.10.85 every link.

Prove that to solve SPT under IR, a message must be sent on

Exercise 3.10.86 Show how to transform a spanning-tree construction algorithm C so as to elect a leader with at most O(n) additional messages. Exercise 3.10.87 Prove that under IR, the problem of ﬁnding the smallest of the entities’ values is computationally equivalent to electing a leader and has the same message complexity. 3.10.2 Problems Problem 3.10.1 Josephus Problem. Consider the following set of electoral rules. In stage i, a candidate x sends its id and receives the id from its two neighboring candidates, r(i, x) and l(i, x): x does not survive this stage if and only if its id is larger than both received ids. Analyze the corresponding protocol Josephus, determining in particular the number of stages and the total number of messages both in the worst and in the average case. Analyze and discuss its time complexity. Problem 3.10.2 Alternating Steps () Design a conﬂict resolution mechanism for the alternating steps strategy to cope lack of orientation in the ring. Analyze the complexity of the resulting protocol Problem 3.10.3 Better Stages () Construct a protocol based on electoral stages c that guarantees ni ≤ ni−1 b with cn messages transmitted in each stage, where log b < 1.89. Problem 3.10.4 Bidirectional MinMax () Design a bidirectional version of MinMax with the same costs. Problem 3.10.5 Distances in MinMax+ () In computing the cost of protocol MinMax+ we have used dis(i) = Fi+2 . Determine what will be the cost if we use dis(i) = 2i instead.

EXERCISES, PROBLEMS, AND ANSWERS

221

Problem 3.10.6 MinMax+ Variations () In protocol MinMax+ we use “promotion by distance” only in the even stages and “promotion by witness” only in the odd stages. Determine what would happen if we use 1. only “promotion by distance” but in every stage; 2. only “promotion by witness” but in every stage; 3. “promotion by distance” in every stage and “promotion by witness” only in odd stages; 4. “promotion by witness” in every stage and “promotion by distance” only in even stages; 5. both “promotion by distance” and “promotion by witness” in every stage. Problem 3.10.7 Bidirectional Oriented Rings. () Prove or disprove that there is an efﬁcient protocol for bidirectional oriented rings that cannot be used nor simulated neither in unidirectional rings nor in general bidirectional ones with the same or better costs. Problem 3.10.8 Unoriented Hypercubes. () Design a protocol that can elect a leader in a hypercube with arbitrary labelling using O(n log log n) messages. Implement and test your protocol. Problem 3.10.9 Linear Election in Hypercubes. () Prove or disprove that it is possible to elect a leader in an hypercube in O(n) messages even when it is not oriented. Problem 3.10.10 Oriented Cube-Connected Cycles () Design an election protocol for an oriented CCC using O(n) messages. Implement and test your protocol. Problem 3.10.11 Oriented Butterﬂy. Design an election protocol for an oriented butterﬂy. Determine its complexity. Implement and test your protocol. Problem 3.10.12 Minimal Chordal Ring () Find a chordal ring with k = 2 where it is possible to elect a leader with O(n) messages. Problem 3.10.13 Unlabelled Chordal Rings () Show how to elect a leader in the chordal ring of Problem 3.10.12 with O(n) messages even if the edges are arbitrarily labeled. Problem 3.10.14 Improved Time () Show how to elect a leader using O(m + n log n) messages but only O(n) ideal time units. Problem 3.10.15 Optimal Time () Show how to elect a leader in O(d) time using at most O(m log d) messages.

222

ELECTION

3.10.3 Answers to Exercises Answer to Exercise 3.10.21 The size of the areas where no feedback is sent in stage i can vary from one another, from stage to stage, and from execution to execution. We can still have an estimate of their size. In fact, the distance di between two candidates in stage i is d(i) ≥ 3i−1 (Exercise 3.10.20). Thus, the total number of message transmissions caused in stage i by the feedback will be at most n − ni+1 3i−1 , yielding a total of at most log n 3n − i=1 3 ni+1 3i−1 messages. Answer to Exercise 3.10.44 Let hj (a) denote the candidate that originated message (a, j ). Consider a message (v, i + 1) and its originator z = hi+1 (v); this message was sent after receiving (v, i) originated by x = hi (v). Let y = hi (u) be the ﬁrst candidate after x in the ring in stage i, and (u, i) the message it originated. As v survives this stage, which is odd (i.e., min), it must be that v < u. Message (v, i) travels from x toward y; upon receiving (v, i), node z in this interval will generate (v, i + 1). Now z cannot be after node hi−1 (u) in the ring because by rule (IV) w = hi−1 (u) would immediately generate (v, i + 1) after receiving (v, i). In other words, either z = w or z is before w. Thus we save at least d(z, y) ≥ d(w, y) = d(hi−1 (u), hi (u)) ≥ Fi , where the last inequality is by Property 3.3.1. Partial Answer to Exercise 3.10.66 Consider a captured node y that receives an attack after the other, say from a candidates x1 in level i. According to the strategy, y will send a Warning to its owner z to inform it of this attack and wait for a reply; depending on the reply, it will notify x1 of whether the attack was successful (the case in which y will be captured by x1 ) or not. Assume now that while waiting, y receives an attack after the other, say from candidates x2 , . . . , xk in that order, all in the same level i. According to the strategy, y will issue a Warning to its owner z for each of them. Observe now that if id(z) > id(x1 ) > . . . > id(xk ), each of these attacks will be successful, and y will in turn be captured by all those candidates. BIBLIOGRAPHY [1] H. Abu-Amara and A. Kanevsky. On the complexities of leader election algorithms. In 5th IEEE International Conference on Computing and Information, pages 202–206, Sudbury, May 1993. [2] Y. Afek and E. Gafni. Time and message bounds for election in synchronous and asynchronous complete networks. SIAM Journal on Computing, 20(2):376–394, 1991. [3] H. Attiya, J. van Leeuwen, N. Santoro, and Shmuel Zaks. Efﬁcient elections in chordal ring networks. Algorithmica, 4:437–446, 1989. [4] B. Awerbuch. Optimal distributed algorithms for minimum weight spanning tree, counting, leader election, and related problems. In 19th Annual ACM Symposium on Theory of Computing, pages 230–240, New York City, May 1987.

BIBLIOGRAPHY

223

[5] H.L. Bodlaender. A better lower bound for distributed leader ﬁnding in bidirectional, asynchronous rings of processors. Information Processing Letters, 27(6):287–290, 1988. [6] H.L. Bodlaender. New lower bound techniques for distributed leader ﬁnding and other problems on rings of processors. Theoretical Computer Science, 81:237–256, 1991. [7] H.L. Bodlaender. Some lower bound results for decentralized extrema-ﬁnding in rings of processors. Journal on Computing and System Sciences, 42(1):97–118, 1991. [8] H.L. Bodlaender and J. van Leeuwen. New upperbounds for distributed extrema-ﬁnding in a ring of processors. In Proc. 1st International Workshop on Distributed Algorithms (WDAG 1), pages 504–512, Ottawa, Aug 1985. [9] J. Burns. A formal model for message passing systems. Technical Report UTR-91, Indiana University, 1981. [10] M.Y. Chan and F.L.Y. Chin. Distributed election in complete networks. Distributed Computing, 3(1):19–22, 1988. [11] M.Y. Chan and F.L.Y. Chin. Improving the time complexity of message-optimal distributed algorithms for minimum-weight spanning trees. SIAM Journal on Computing, 19(4):612– 626, 1990. [12] E.J.H. Chang and R. Roberts. An improved algorithm for decentralized extrema-ﬁnding in circular conﬁgurations of processes. Communications of the ACM, 22(5):281–283, May 1979. [13] S. Dobrev. Leader election using any sense of direction. In 6th International Colloquium on Structural Information and Communication Complexity, pages 93–104, Lacanau, July 1999. [14] S. Dobrev and P. Ruzicka. Linear broadcasting and O(n log log n) election in unoriented hypercubes. In 4th International Colloquium on Structural Information and Communication Complexity, pages 53–68, Ascona, July 1997. [15] D. Dolev, M. Klawe, and M. Rodeh. An O(n log n) unidirectional algorithm for extremaﬁnding in a circle. Journal of Algorithms, 3:245–260, 1982. [16] P. Flocchini and B. Mans. Optimal elections in labeled hypercubes. Journal of Parallel and Distributed Computing, 33(1):76–83, 1996. [17] W.R. Franklin. On an improved algorithm for decentralized extrema-ﬁnding in a circular conﬁguration of processes. Communications of the ACM, 25(5):336–337, May 1982. [18] R.G. Gallager. Finding a leader in a network with O(e) + O(n log n) messages. Technical Report Internal Memo, M.I.T., 1979. [19] R.G. Gallager, P.A. Humblet, and P.M. Spira. A distributed algorithm for minimum spanning tree. ACM Transactions on Programming Languages and Systems, 5(1):66–77, 1983. [20] J. A. Garay, S. Kutten, and D. Peleg. A sublinear time distributed algorithm for minimumweight spanning trees. SIAM Journal on Computing, 27(1):302–316, February 1998. [21] L. Higham and T. Przytycka. A simple, efﬁcient algorithm for maximum ﬁnding on rings. Information Processing Letters, 58:319–324, 1996. [22] D.S. Hirschberg and J.B. Sinclair. Decentralized extrema ﬁnding in circular conﬁgurations of processors. Communications of the ACM, 23:627–628, 1980. [23] P.A. Humblet. Selecting a leader in a clique in O(n log n) messages. In Proc. 23rd Conf. on Decision and Control, pages 1139–1140, Las Vegas, Dec. 1984. [24] T.Z. Kalamboukis and S.L. Mantzaris. Towards optimal distributed election on chordal rings. Information Processing Letters, 38(5):265–270, 1991.

224

ELECTION

[25] E. Korach, S. Kutten, and S. Moran. A modular technique for the design of efﬁcient distributed leader ﬁnding algorithms. ACM Transactions on Programming Languages and Systems, 12(1):84–101, January 1990. [26] E. Korach, S. Moran, and S. Zaks. Optimal lower bounds for some distributed algorithms for a complete network of processors. Theoretical Computer Science, 64:125–132, 1989. [27] E. Korach, D. Rotem, and N. Santoro. Distributed election in a circle without a global sense of orientation. International Journal of Computer Mathematics, 16:115–124, 1984. [28] E. Korach, D. Rotem, and N. Santoro. Analysis of a distributed algorithm for extrema ﬁnding in a ring. Journal of Parallel and Distributed Computing, 4:575–591, 1987. [29] G. Le Lann. Distributed systems: Toward a formal approach. In IFIP Conference on Information Processing, pages 155–160, 1977. [30] C. Lavault. Average number of messages for distributed leader-ﬁnding in rings of processors. Information Processing Letters, 30(4):167–176, 1989. [31] C. Lavault. Exact average message complexity values for distributed election on bidirectional rings of processors. Theoretical Computer Science, 73(1):61–79, 1990. [32] M.C. Loui, T.A. Matsushita, and D.B. West. Election in complete networks with a sense of direction. Information Processing Letters, 22:185–187, 1986. see also Information Processing Letters, vol.28:327, 1988. [33] B. Mans. Optimal distributed algorithms in unlabeled tori and chordal rings. Journal of Parallel and Distributed Computing, 46(1):80–90, 1997. [34] F. Mattern. Message complexity of simple ring-based election algorithms-an empirical analysis. In 9th IEEE International Conference on Distributed Computing Systems, pages 94–100, 1989. [35] S. Moran, M. Shalom, and S. Zaks. An 1.44...n log n algorithm for distributed leader ﬁnding in bidirectional rings of processors. Technical Report RC 11933, IBM Research Division, 1986. [36] J. Pachl, D. Rotem, and E. Korach. Lower bounds for distributed maximum ﬁnding algorithms. Journal of the ACM, 31:905–917, 1984. [37] Y. Pan. An improved election algorithm in chordal ring networks. International Journal of Computer Mathematics, 40(3-4):191–200, 1991. [38] G.L. Peterson. Improved algorithms for elections in meshes and complete networks. Technical report, Georgia Institute of Techchnology, December 1986. [39] G.L. Peterson. An O(n log n) unidirectional algorithm for the circular extrema problem. A.C.M. Transactions on Programming Languages and Systems, 4(4):758–762, oct 1982. [40] S. Robbins and K.A. Robbins. Choosing a leader on a hypercube. In N. Rishe, S. Najathe, and D. Tal, editors, PARBASE-90, International Conference on Databases, Parallel Aarchitectures and their Applications, pages 469–471, Miami Beach, 1990. [41] N. Santoro. On the message complexity of distributed problems. Journal of Computing and Information Sciences, 13:131–147, 1984. [42] G. Singh. Leader election in complete networks. SIAM Journal on Computing, 26(3):772– 785, 1997. [43] G. Tel. Linear election in oriented hypercubes. Parallel Processing Letters, 5:357–366, 1995. [44] J. van Leeuwen and R.B. Tan. An improved upperbound for distributed election in bidirectional rings of processors. Distributed Computing, 2(3):149–160, 1987.

CHAPTER 4

Message Routing and Shortest Paths

4.1 INTRODUCTION Communication is at the base of computing in a distributed environment, but the task to achieve it efﬁciently is neither simple nor trivial. Consider an entity x that wants to communicate some information to another entity y; for example, x has a message that it wants to be delivered to y. In general, x does not know where y is or how to reach it (i.e., which paths lead to it); actually, it might not even know if y is a neighbor or not. is strongly connected. Still, the communication is always possible if the network G In fact, it is sufﬁcient for x to broadcast the information: every entity, including y will receive it. This simple solution, called broadcast routing, is obviously not efﬁcient; on the contrary, it is impractical, expensive in terms of cost, and not very secure (too many other nodes receive the message), even if it is performed only on a spanning-tree of the network. from x to y: The message A more efﬁcient approach is to choose a single path in G sent by x will travel along this path only, relayed by the entities in the path, until it reaches its destination y. The process of determining a path between a source x and a destination y is known as routing. If there is more than one path from x to y, we would obviously like to choose the “best” one, that is, the least expensive one. The cost θ(a, b) ≥ 0 of a link (a, b), traditionally called length, is a value that depends on the system (reﬂecting, e.g., time delay, transmission cost, link reliability, etc.), and the cost of a path is the sum of the costs of the links composing it. The path of minimum cost is called shortest path; clearly, the objective is to use this path for sending the message. The process of determining the most economic path between a source and a destination is known as shortest-path routing. The (shortest-path) routing problem is commonly solved by storing at each entity x the information that will allow to address a message to its destination through a (shortest) path. This information is called routing table. In this chapter we will discuss several aspects of the routing problem. First of all, we will consider the construction of the routing tables. We will then address Design and Analysis of Distributed Algorithms, by Nicola Santoro Copyright © 2007 John Wiley & Sons, Inc.

225

226

MESSAGE ROUTING AND SHORTEST PATHS

h

3

5

k

1

f

5

e

10

8 c

2

d

(a)

3

5

k

1

3

3

s

h

f

3

3 5

s

e

10

8 c

2

d

(b)

FIGURE 4.1: Determining the shortest paths from s to the other entities.

the problem of maintaining the information of the tables up to date, should changes occur in the system. Finally, we will discuss how to represent routing information in a compact way, suitable for systems where space is a problem. In the following, and unless otherwise speciﬁed, we will assume the set of restrictions IR: Bidirectional Links (BL), Connectivity (CN), Total Reliability (TR), and Initial Distinct Values (ID).

4.2 SHORTEST PATH ROUTING The routing table of an entity contains information on how to reach any possible destination. In this section we examine how this information can be acquired, and the table constructed. As we will see, this problem is related to the construction of particular spanning-trees of the network. In the following, and unless otherwise speciﬁed, we will focus on shortest-path routing. Different types of routing tables can be deﬁned, depending on the amount of information contained in them. We will consider for now the full routing table: For each destination, there is stored a shortest path to reach it; if there are more than one shortest path, only the lexicographically smallest1 will be stored. For example, in the network of Figure 4.1, the routing table RT(s) for s is shown in Table 4.1. We will see different approaches to construct routing tables, some depending on the amount of local storage an entity has available. 4.2.1 Gossiping the Network Maps A ﬁrst obvious solution would be to construct at every entity the entire map of the network with all the costs; then, each entity can locally and directly compute its shortest-path routing table. This solution obviously requires that the local memory available to an entity is large enough to store the entire map of the network. 1

The lexicographic order will be over the strings of the names of the nodes in the paths.

SHORTEST PATH ROUTING

227

TABLE 4.1: Full Routing Table for Node s Routing Destination

Shortest Path

Cost

h k c d e f

(s, h) (s, h)(h, k) (s, c) (s, c)(c, d) (s, e) (s, e)(e, f )

1 4 10 12 5 8

The map of the network can be viewed as an n × n array MAP(G), one row and one column per entity, where for any two entities x and y, the entry MAP[x, y] contains information on whether link (x, y) exists, and if so on its cost. In a sense, each entity x knows initially only its own row MAP[x, ]. To know the entire map, every entity needs to know the initial information of all the other entities. This is a particular instance of a general problem called input collection or gossip: every entity has a (possibly different) piece of information; the goal is to reach a ﬁnal conﬁguration where every entity has all the pieces of information. The solution of the gossiping problem using normal messages is simple: every entity broadcasts its initial information. Since it relies solely on broadcast, this operation is more efﬁciently performed in a tree. Thus, the protocol will be as follows: Map Gossip: 1. An arbitrary spanning tree of the network is created, if not already available; this tree will be used for all communication. 2. Each entity acquires full information about its neighborhood (e.g., names of the neighbors, cost of the incident links, etc.), if not already available. 3. Each entity broadcasts its neighborhood information along the tree. At the end of the execution, each entity has a complete map of the network with all the link costs; it can then locally construct its shortest-path routing table. The construction of the initial spanning-tree can be done using O(m + n log n) messages, for example using protocol MegaMerger. The acquisition of neighborhood information requires a single exchange of messages between neighbors, requiring in total just 2m messages. Each entity x then broadcasts on the tree deg(x) items of information. Hence the total number of messages will be at most x

deg x n − 1 = 2m n − 1 .

Thus, we have M[Map Gossip] = 2 m n + l.o.t.

(4.1)

228

MESSAGE ROUTING AND SHORTEST PATHS

This means that, in sparse networks, all the routing tables can be constructed with at most O(n2 ) normal messages. Such is the case of meshes, tori, butterﬂies, and so forth. In systems that allow very long messages, not surprisingly the gossip problem, and thus the routing table construction problem, can be solved with substantially fewer messages (Exercises 4.6.3 and 4.6.4). The time costs of gossiping on a tree depend on many factors, including the diameter of the tree and the number of initial items an entity initially has (Exercise 4.6.2). 4.2.2 Iterative Construction of Routing Tables The solution we have just seen requires that each entity has locally available enough storage to store the entire map of the network. If this is not the case, the problem of constructing the routing tables is more difﬁcult to resolve. Several traditional sequential methods are based on an iterative approach. Initially, each entity x knows only its neighboring information: for each neighbor y, the entity knows the cost θ(x, y) of reaching it using the direct link (x, y). On the basis of this initial information, x can construct an approximation of its routing table. This imperfect table is usually called distance vector, and in it the cost for those destinations x knows nothing about will be set to ∞. For example, the initial distance vector for node s in the network of Figure 4.1 is shown in Table 4.2. This approximation of the routing table will be reﬁned, and eventually corrected, through a sequence of iterations. In each iteration, every entity communicates its current distance vector with all its neighbors. On the basis of the received information, each entity updates its current information, replacing paths in its own routing table if the neighbors have found better routes. How can an entity x determine if a route is better ? The answer is very simple: when, in an iteration, x is told by a neighbor y that there exists a path π2 from y to z with cost g2 , x checks in its current table the path π1 to z and its cost g1 , as well as the cost θ (x, y). If θ(x, y) + g2 < g1 , then going directly to y and then using π2 to reach z is less expensive than going to z through the path π1 currently in the table. Among several better choices, obviously x will select the best one.

TABLE 4.2: Initial Approximation of RT(s) Routing Destination

Shortest Path

Cost

h k c d e f

(s, h) ? (s, c) ? (s, e) ?

1 ∞ 10 ∞ 5 ∞

SHORTEST PATH ROUTING

TABLE 4.3: Initial Distance Vectors s h k c d s h k c d e f

1 ∞ 10 ∞ 5 ∞

1 3 ∞ ∞ ∞ ∞

∞ 3 ∞ ∞ 3 5

10 ∞ ∞ 2 ∞ ∞

∞ ∞ ∞ 2 8 ∞

e

f

5 ∞ 3 ∞ 8 3

∞ ∞ 5 ∞ ∞ 3 -

229

Speciﬁcally, let Vyi [z] denote the cost of the “best” path from y to z known to y in iteration i; this information is contained in the distance vector sent by y to all its neighbors at the beginning of iteration i + 1. After sending its own distance vector and upon receiving the distance vectors of all its neighbors, entity x computes w[z] = Miny∈N(x) (θ(x, y) + Vyi [z]) for each destination z. If w[z] < Vxi [z], then the new cost and the corresponding path to z is chosen, replacing the current selection. Why should interaction just with the neighbors be sufﬁcient follows from the fact that the cost γa (b) of the shortest path from a to b has the following deﬁning property: Property 4.2.1 γa (b) =

0 if a = b Minw∈N(a) {θ(a, w) + γw (b)} otherwise.

The Protocol Iterated Construction based on this strategy converges to the correct information and will do so after at most n − 1 iterations (Exercise 4.6.8). For example, in the graph of Figure 4.1, the process converges to the correct routing tables after only two iterations; see Tables 4.3–4.5 : for each entity, only the cost information for every destination is displayed. The main advantage of this process is that the amount of storage required at an entity is proportional to the size of the routing table and not to the map of the entire system. TABLE 4.4: Distance Vectors After First Iteration s h k c d e f s h k c d e f

1 4 10 12 5 8

1 3 11 ∞ 6 8

4 3 ∞ 11 3 5

10 11 ∞ 2 10 ∞

12 ∞ 11 2 8 11

5 6 3 10 8 3

8 8 5 ∞ 11 3 -

230

MESSAGE ROUTING AND SHORTEST PATHS

TABLE 4.5: Distance Vectors After Second Iteration s h k c d e f s h k c d e f

1 4 10 12 5 8

1 3 11 13 6 8

4 3 13 11 3 5

10 11 13 2 10 13

12 13 11 2 8 11

5 6 3 10 8 3

8 8 5 13 11 3 -

Let us analyze the message and time costs of the associated protocol. In each iteration, an entity sends its distance vector containing costs and path information; actually, it is not necessary to send the entire path but only the ﬁrst hop in it (see discussion in Section 4.4). In other words, in each iteration, an entity x needs to send n items of information to its deg(x) neighbors. Thus, in total, an iteration requires 2nm messages. As this process terminates after at most n − 1 iterations, we have M[Iterated Construction] = 2 (n − 1) n m.

(4.2)

That is, this approach is more expensive than the one based on constructing all the maps; it does, however, require less local storage. As for the time complexity, let τ (n) denote the amount of ideal time required to transmit n items of information to the same neighbor; then T[Iterated Construction] = (n − 1) τ (n).

(4.3)

Clearly, if the system allows very long messages, the protocol can be executed with fewer messages. In particular, if messages containing O(n) items of information (instead of O(1)) are possible, then in each iteration an entity can transmit its entire distance vector to a neighbor with just one message and τ (n) = 1. The entire process can thus be accomplished with O(n, m) messages and the time complexity would then be just n − 1. 4.2.3 Constructing Shortest-Path Spanning Tree The ﬁrst solution we have seen, protocol Map Gossip, requires that each entity has locally available enough storage to store the entire map of the network. The second solution, protocol Iterative Construction, avoids this problem, but it does so at the expense of a substantially increased amount of messages. Our goal is to design a protocol that, without increasing the local storage requirements, constructs the routing tables with a smaller amount of communication. Fortunately, there is an important property that will help us in achieving this goal.

SHORTEST PATH ROUTING

231

Consider the paths contained in the full routing table RT(s) of an entity s, for example, the ones in Table 4.1.These paths deﬁne a subgraph of the network (as not every link is included). This subgraph is special: It is connected, contains all the nodes, and does not have cycles (see Figure 4.1 where the subgraph links are in bold); in other words, it is a spanning tree! It is called the shortest path spanning tree rooted in s(PT(s)), sometimes also known as the sink tree of s. This fact is important because it tells us that, to construct the routing table RT(s) of s, we just need to construct the shortest path spanning tree PT(s). Protocol Design To construct the shortest path spanning tree PT(s), we can adapt a classical serial strategy for constructing PT(s) starting from the source s: Serial Strategy We are given a connected fragment T of PT(s), containing s (initially, T will be composed of just s). Consider now all the links going outside of T (i.e., to nodes not yet in T). To each such link (x, y) associate the value v(x, y) = γs (x) + θ (x, y), that is, v(x, y) is the cost of reaching y from the source s by ﬁrst going to x (through a shortest path) and then using the link (x, y) to reach y. Add to T the link (a, b) for which v(a, b) is minimum; in case of a tie, choose the one leading to the node with the lexicographically smallest name. The reason this strategy works is because of the following property: Property 4.2.2 Let T and (a, b) be as deﬁned in the serial strategy. Then T ∪ (a, b) is a connected fragment T of PT(s). That is, the new tree, obtained by adding the chosen (a, b) to T, is also a connected fragment of PT(s), containing s, and it is clearly larger than T. In other words, using this strategy, the shortest path spanning-tree PT(s) will be constructed, starting from s, by adding the appropriate links, one at the time. The algorithm based on this strategy will be a sequence of iterations started from the root. In each iteration, the outgoing link (a, b) with minimum cost v(a, b) is chosen; the link (a, b) and the node b are added to the fragment, and a new iteration is started. The process terminates when the fragment includes all the nodes. Our goal is now to implement this algorithm efﬁciently in a distributed way. First of all, let us consider what a node y in the fragment T knows. Deﬁnitely y knows which of its links are part of the current fragment; it also knows the length γs (y) of the shortest path from the source s to it.

232

MESSAGE ROUTING AND SHORTEST PATHS

IMPORTANT. Let us assume for the moment that y also knows which of its links are outgoing (i.e., lead to nodes outside of the current fragment) and which are internal. In this case, to ﬁnd the outgoing link (a, b) with minimum cost v (a, b) is rather simple, and the entire iteration is composed of four easy steps: Iteration 1. The root s broadcasts in T the start of the new iteration. 2. Upon receiving the start, each entity x in the current fragment T computes locally v(x, y)= γs (x) + θ (x, y) for each of its outgoing incident links (x, y); it then selects among them the link e = (x, y ) for which v(x, y ) is minimized. 3. The overall minimum v(a, b) among all the locally selected v(e)’s is computed at s, using a minimum-ﬁnding for (rooted) trees (e.g., see Section 2.6.7), and the corresponding link (a, b) is chosen as the one to be added to the fragment. 4. The root s notiﬁes b of the selection; the link (a, b) is added to the spanning-tree; b computes γs (b), and s is notiﬁed of the end of the iteration. Each iteration can be performed efﬁciently, in O(n) messages, as each operation (broadcast, min-ﬁnding, notiﬁcations) is performed on a tree of at most n nodes. There are a couple of problems that need to be addressed. A small problem is how can b compute γs (b). This value is actually determined at s by the algorithm in this iteration; hence, s can communicate it to b when notifying it of its selection. A more difﬁcult problem regards the knowledge of which links are outgoing (i.e., they lead to nodes outside of the current fragment); we have assumed that an entity in T has such a knowledge about its links. But how can such a knowledge be ensured? As described, during an iteration, messages are sent only on the links of T and on the link selected in that iteration. This means that the outgoing links are all unexplored (i.e., no message has been sent or received on them). As we do not know which are outgoing, an entity could perform the computation of step 2 for each of its unexplored incident links and select the minimum among those. Consider for example the graph of Figure 4.2(a) and assume that we have already constructed the fragment shown in Figure 4.2(b). There are four unexplored links incident to the fragment (shown as leading to square boxes), each with its value (shown in the corresponding square box); the link (s, e) among them has minimum value and is chosen; it is outgoing and it is added to the segment. The new segment is shown in Figure 4.2(c) together with the unexplored links incident on it. However, not all unexplored links are outgoing: An unexplored link might be internal (i.e., leading to a node already in the fragment), and selecting such a link would be an error. For example, in Figure 4.2(c), the unexplored link (e, k) has value v(e, k) = 7, which is minimum among the unexplored edges incident on the fragment, and hence would be chosen; however, node e is already in the fragment. We could allow for errors: We choose among the unexplored links and, if the link (in our example: (e, k)) selected by the root s in step 3 turns out to be internal

SHORTEST PATH ROUTING

3

h

5

k

1

f

5

s

e

10

(b)

5 3

k

9

5

e

8

3

9

5

s

3

e

8

8

8

10

5

k

1

7

3

3

h

3

1 s

5

10

(a)

h

7

10

d

2

3 5

s

8 c

k

1

3

3

9

5

3

h

233

8

10 13

13 10

10

(c)

(d)

FIGURE 4.2: Determining the next link to be added to the fragment.

(k would ﬁnd out in step 4 when the notiﬁcation arrives), we eliminate that link from consideration and select another one. The drawback of this approach is its overall cost. In fact, since initially all links are unexplored, we might have to perform the entire selection process for every link. This means that the cost will be O(nm), which in the worst case is O(n3 ): a high price to construct a single routing table. A more efﬁcient approach is to add a mechanism so that no error will occur. Fortunately, this can be achieved simply and efﬁciently as follows. When a node b becomes part of the tree, it sends a message to all its neighbors notifying them that it is now part of the tree. Upon receiving such a message, a neighbor c knows that this link must no longer be used when performing shortest path calculations for the tree. As a side effect, in our example, when the link (s, e) is chosen in Figure 4.2(b), node e already knows that the link (e, k) leads to a node already in the fragment; thus such a link is not considered, as shown in Figure 4.2(d). RECALL. We have used a similar strategy with the protocol for depth-ﬁrst traversal, to decrease its time complexity. IMPORTANT. It is necessary for b to ensure that all its neighbors have received its message before a new iteration is started. Otherwise, due to time delays, a neighbor

234

MESSAGE ROUTING AND SHORTEST PATHS

c might receive the request to compute the minimum for the next iteration before the message from b has even arrived; thus, it is possible that c (not knowing yet that b is part of the tree) chooses its link to b as its minimum, and such a choice is selected as the overall minimum by the root s. In other words, it is still possible that an internal link is selected during an iteration. Summarizing, to avoid mistakes, it is sufﬁcient to modify rule 4 as follows: 4. The root s sends an Expand message to b and the link (a, b) is added to the spanning tree; b computes γs (b), sends a notiﬁcation to its neighbors, waits for their acknowledgment, and then notiﬁes s of the end of the iteration. This ensures that there will be only n − 1 iterations, each adding a new node to the spanning tree, with a total cost of O(n2 ) messages. Clearly we must also consider the cost of each node notifying its neighbors (and them sending acknowledgments), but this adds only O(m) messages in total. The protocol, called PT Construction, is shown in Figures 4.3–4.6. Analysis Let us now analyze the cost of protocol PT Construction in details. There are two basic activities being performed: the expansion of the current fragment of the tree and the announcement (with acknowledgments) of the addition of the new node to the fragment. Let us consider the expansion ﬁrst. It consists of a “start-up” (the root broadcasting the Start Iteration message), a “convergecast” (the minimum value is collected at the root using the MinValue messages), two “notiﬁcations” (the root notiﬁes the new node using the Expansion message, and the new node notiﬁes the root using the Iteration Completed message). Each of these operations is performed on the current fragment, which is a tree, rooted in the source. In particular, the start-up and the convergecast operations each cost only one message on every link; in the notiﬁcations, messages are sent only on the links in path from the source to the new node, and there will be only one message in each direction. Thus, in total, on each link of the tree constructed so far, there will be at most four messages due to the expansion; two messages will also be sent on the new link added in this expansion. Thus, in the expansion at iteration i, at most 4(ni − 1) + 2 messages will be sent, where ni is the size of the current tree. As the tree is expanded by one node at the time, ni = i. In fact, initially there is only the source; then the fragment is composed of the source and a neighbor, and so on. Thus, the total number of messages due to the expansion is n−1

n−1

i=1

i=1

(4(ni − 1) + 2) =

(4i − 2) = 2n(n − 1) − 2(n − 1) = 2n2 − 4n + 2.

The cost due to announcements and acknowledgments is simple to calculate: Each node will send a Notify message to all its neighbors when it becomes part of the tree

SHORTEST PATH ROUTING

235

PROTOCOL PT Construction.

States: S = { INITIATOR, IDLE, AWAKE, ACTIVE, WAITING FOR ACK, COMPUTING, DONE }; SINIT = { INITIATOR,IDLE }; STERM = { DONE }.

Restrictions: IR ; UI. INITIATOR

Spontaneously begin source:= true; my distance:= 0; ackcount:= |N (x)|; send(N otify) to N (x); end Receiving(Ack) begin ackcount:= ackcount - 1; if ackcount = 0 then iteration:= 1; v(x, y) := MIN{v(x, z) : z ∈ N (x)}; path length:= v(x, y); Children:={y}; send(Expand, iteration, path length) to y; Unvisited:= N (x) − {y}; become ACTIVE; endif end IDLE Receiving(N otify) begin Unvisited:= N (x) − {sender}; send(Ack) to sender; become AWAKE; end AWAKE Receiving(Expand, iteration , path value ) begin my distance:= path value ; parent:= sender; Children:= ∅; if |N (x)| > 1 then send(N otify) to N (x) − {sender}; ackcounter:= |N (x)| − 1; become WAITING FOR ACK; else send(I terationCompleted) to parent; become ACTIVE; endif end

FIGURE 4.3: Protocol PT-Construction (I)

236

MESSAGE ROUTING AND SHORTEST PATHS

AWAKE Receiving(N otify) begin Unvisited:= Unvisited−{sender}; send(Ack) to sender; end WAITING FOR ACK Receiving(Ack) begin ackcount:= ackcount - 1; if ackcount = 0 then send(I terationCompleted) to parent; become ACTIVE; endif end ACTIVE Receiving(I teration Completed) begin if not(source) then send(I teration Completed) to parent; else iteration:= iteration + 1; send(Start I teration, iteration) to children; Compute Local Minimum; childcount:= 0; become COMPUTING; endif end Receiving(Start I teration, iteration ) begin iteration:= iteration ; Compute Local Minimum; if children = ∅ then send(MinV alue, minpath) to parent; else send(Start I teration, iteration) to children; childcount:=0; become COMPUTING; endif end

FIGURE 4.4: Protocol PT-Construction (II)

and receives an Ack from each of them. Thus, the total number of messages due to the notiﬁcations is 2 |N(x)| = 2 deg(x) = 4m. x∈V

x∈V

To complete the analysis, we need to consider the ﬁnal broadcast of the Termination message, which is performed on the constructed tree; this will add n − 1 messages to the total, yielding the following: M[PT Construction] ≤ 2n2 + 4m − 3n + 1

(4.4)

SHORTEST PATH ROUTING

ACTIVE

237

Receiving(Expand, iteration , path value ) begin send(Expand, iteration , path value ) to exit; if exit = mychoice then Children := Children ∪ {mychoice}; Unvisited := Unvisited − {mychoice}; endif end Receiving(N otify) begin Unvisited:= Unvisited −{sender}; send(Ack) to sender; end Receiving(T erminate) begin send(T erminate) to children; become DONE; end

COMPUTING Receiving(MinV alue, path value ) begin if path value < minpath then minpath:= path value ; exit:= sender; endif childcount :=childcount + 1; if childcount = |Children| then if not(source) then send(MinV alue, minpath) to parent; become ACTIVE; else Check for Termination; endif endif end

FIGURE 4.5: Protocol PT Construction (III)

By adding a little bookkeeping, the protocol can be used to construct the routing table RT(s) of the source (Exercise 4.6.13). Hence, we have a protocol that constructs the routing table of a node using O(n2 ) messages. We will see later how more efﬁcient solutions can be derived for the special case when all the links have the same cost (or, alternatively, there is no cost on the links). Note that we have made no assumptions other than that the costs are non-negative; in particular, we did not assume ﬁrst in ﬁrst out (FIFO) channels (i.e., message ordering). 4.2.4 Constructing All-Pairs Shortest Paths Protocol PT Construction allows us to construct the shortest-path tree of a node, and thus to construct the routing table of that entity. To solve the original problem of constructing all the routing table, also known as all-pairs shortest-paths construction,

238

MESSAGE ROUTING AND SHORTEST PATHS

Procedure Check f or T ermination begin if minpath= inf then send(T erminate) to Children; become DONE; else send(Expand, iteration, minpath) to exit; become ACTIVE; endif end

Procedure Compute Local Minimum begin if Unvisited = ∅ then minpath:= inf; else link length:= v(x, y) = MIN{v(x, z) : z ∈ Unvisited}; minpath:= my distance + link length; mychoice:= exit:= y; endif end

FIGURE 4.6: Procedures used by protocol PT Construction

this process must be repeated for all nodes. The complexity of resulting protocol PT All follows immediately from equation 4.4: M[PT All] ≤ 2n3 − 3n2 + 4(m − 1)n

(4.5)

The costs of protocols Map Gossip, Iterative Construction, and PT All are shown in Figure 4.7. Deﬁnitively better than protocol Iterative Construction, protocol PT All matches the worst case cost of Map Gossip without requiring large amounts of local storage. Hence, it is an efﬁcient solution. It is clear that some information computed when constructing PT(x) can be reused in the construction of PT(y). For example, the shortest path from x to y is just the reverse of the one from y to x (under the bidirectional links assumption we are using); hence, we just need to determine one of them. Even stronger is the so-called optimality principle: Property 4.2.3 If a node x is in the shortest path π from a to b, then π is also a fragment of PT(x) Hence, once a shortest path π has been computed for the shortest path tree of an entity, this path can be added to the shortest path tree of all the entities in the path. So, in the example of Figure 4.1, the path (s, e)(e, f ) in PT(s) will also be a part of Algorithm Map Gossip Iterative Construction PT All SparserGossip

Cost O(n m) O(n2 m) O(n3 ) O(n2 log n)

restrictions

⍀(m) local storage

FIGURE 4.7: Constructing all shortest path routing tables.

SHORTEST PATH ROUTING

239

PT(e) and PT(f ). However, to date, it is not clear how this fact can be used to derive a more efﬁcient protocol for constructing all the routing tables. Constructing a Sparser Subgraph Interestingly, the number of messages can be brought down from O(n3 ) to O(n2 log n) not by cleverly exploiting information but rather by cleverly constructing a spanning subgraph of the network, called sparser and then simulating the execution of Map Gossip on it. To understand this subgraph, we need some terminology. Given a subset V ⊆ V of the nodes, we call the eccentricity of x ∈ V in V its largest distance from the other nodes of V , that is, r(x, V ) = maxy∈V {dG (x, y)}; then r(V ) = maxx∈V {r(x, V )} is called the radius of V . The density of x ∈ V in V instead is the number of its neighbors that are in V , that is, den(x, V ) = |N(x) ∪ V |; the density of V is the sum of the densities of all its nodes: den(V ) = x∈V den(x, V ). Given a collection A of subsets of the nodes, the radius r(A) of A will be just the largest among the radii of those subsets; the density den(A) will be just the sum of the densities of those subsets. A (a, b)-sparser is just a partition of the set V of nodes into subsets such that its radius is r(S) = a and its density is den(S) = b. The basic idea is to ﬁrst of all 1. construct a sparser V = V1 , . . . , Vk ; 2. elect a leader xi in each of its sets Vi ; 3. establish a path connecting the two leaders of each pair of neighboring subsets. Then the execution of the protocol in G is simulated in the sparser. What this means is that 4. each leader executes the algorithm for each node in its subset; 5. whenever in the algorithm a message is sent from a node in Vi to a node in Vj , the message is sent by xi to xj . An interesting consequence of (5) above is that the cost of a node u sending a message to all its neighbors, when simulated in the sparser, will depend on the number of subsets in which u has neighbors as well as on the distance between the corresponding leaders. This means that for the simulation to be efﬁcient, the radius should be small, r(V ) = O(log n), and the density at most linear, den(S) = O(n). Fortunately we have (Exercise 4.6.15): Property 4.2.4 Any connected graph G of n nodes has a (log n, n)-sparser. The existence of this good sparser is not enough; we must be able to construct it with a reasonable amount of messages. Fortunately, this is also possible (Exercise

240

MESSAGE ROUTING AND SHORTEST PATHS

4.6.16). When constructing it, there are several important details that must be taken care; in particular, the paths between the centers must be uniquely determined. Once all of this is done, we must then deﬁne the set of rules (Exercise 4.6.17) to simulate protocol MapGossip. At this point, the resulting protocol, called SparserGossip, yields the desired performance M[SparserGossip] = O(n2 log n).

(4.6)

Using Long Messages In systems that allow very long messages, not surprisingly the problem can be solved with fewer messages. For example, if messages can contain O(n) items of information (instead of O(1)), all the shortest path trees can be constructed with just O(n2 ) messages (Exercise 4.6.18). If messages can contain O(n2 ) items, then any graph problem including the construction of all shortest path trees can be solved using O(n) messages once a leader has been elected (requiring at least O(m + n log n) normal messages). A summary of all these results is shown in Figure 4.7. 4.2.5 Min-Hop Routing Consider the case when all links have the same cost (or alternatively, there are no costs associated to the links), that is, θ(a, b) = θ for all (a, b) ∈ E. This case is special in several respects. In particular, observe that the shortest path from a to b will have cost γa (b) = θ dG (a, b), where dG (a, b) is the distance (in number of hops) of a from b in G; in other words, the cost of a path will depend solely on the number of hops (i.e., the number of links) in that path. Hence, the shortest path between two nodes will be the one with minimum hops. For these reasons, routing in this situation is called min-hop routing. An interesting consequence is that the shortest path spanning tree of a node coincides with its breadth-ﬁrst spanning tree. In other words, a breadth-ﬁrst spanning tree rooted in a node is the shortest path spanning tree of that node when all links have the same cost. Protocol PT Construction works for any choice of the costs, provided they are non-negative; so it constructs a breadth-ﬁrst spanning tree if all the costs are the same. However, we can take advantage of the fact that all links have the same costs to obtain a more efﬁcient protocol. Let us see how. Breadth-First Spanning-Tree Construction Without any loss of generality, let us assume that θ = 1; thus, γs (a) = dG (s, a). We can use the same strategy of protocol PT Construction of starting from s and successively expanding the fragment; only, instead of choosing one link (and thus one node) at the time, we can choose several simultaneously: In the ﬁrst step, s chooses all the nodes at distance 1 (its neighbors); in the second step, s chooses simultaneously all the nodes at distance 2; in general, in step i, s chooses simultaneously all the nodes at distance i; notice that before step i, none of the nodes at distance i was a part of the

SHORTEST PATH ROUTING

241

fragment. Clearly, the problem is to determine, in step i, which nodes are at distance i from s. Observe this very interesting property: All the neighbors of s are at distance 1 from s; all their neighbors (not at distance 1 from s) are at distance 2 from s; in general, Property 4.2.5 If a node is at distance i from s, then its neighbors are at distance either i − 1 or i or i + 1 from s. This means that once the nodes at distance i from s have been chosen (and become part of the fragment), we need to consider only their neighbors to determine which nodes are at distance i + 1. So the protocol, which we shall call BF, is rather simple. Initially, the root s sends a “start iteration 1” message to each neighbor indicating the ﬁrst iteration of the algorithm and considers them its children. Each recipient marks its distance as 1, marks the sender as its parent, and sends an acknowledgment back to the parent. The tree is now composed of the root s and its neighbors, which are all at distance 1 from s. In general, after iteration i all the nodes at distance up to i are part of the tree. Furthermore, each node at distance i knows which of its neighbors are at distance i − 1 (Exercise 4.6.19). In iteration i + 1, the root broadcasts on the current tree a “start iteration i + 1” message. Once this message reaches a node x at distance i, it sends a “explore i + 1” message to its neighbors that are not at distance i − 1 (recall, x knows which they are) and waits for a reply from each of them. These neighbors are either at distance i like x itself, or at i + 1; those at distance i are already in the tree and so do not need to be included. Those at distance i + 1 must be attached to the tree; however, each must be attached only once (otherwise we create a cycle and do not form a tree; see Figure 4.8). When a neighbor y receives the “Explore” message, the content of its reply will depend on whether or not y is already part of the tree. If y is not part of the tree, it now knows that it is at distance i + 1 from s; it then marks the sender as its parent, sends a positive acknowledgment to it, and becomes part of the tree. If y is part of the tree (even if it just happened in this iteration), it will reply with a negative acknowledgment. When x receives the reply from y, if the reply is positive, it will mark y as a child, otherwise, it will mark y as already in the tree. Once all the replies have been received, it participates in a convergecast notifying the root that the iteration has been completed. Cost Let us now examine the cost of protocol BF. Denote by ni the number of nodes at distance at most i from s. In each iteration, there are three operations involving communication: (1) the broadcast of “Start”on the tree constructed so far; (2) the sending of “Explore” messages sent by the nodes at distance i, and the corresponding replies; and (3) the convergecast to notify the root of the termination of the iteration. Consider ﬁrst the cost of operation (2), that is, the cost of the “Explore” messages and the corresponding replies. Consider a node x at distance i. As already mentioned, its neighbors are at distance either i − 1 or i or i + 1. The neighbors at distance i − 1

242

MESSAGE ROUTING AND SHORTEST PATHS

FIGURE 4.8: Protocol BF expands an entire level in each iteration.

sent an “Explore” message to x in stage i − 1, so x sent a reply to each of them. In stage i x sent an “Explore” message to all its other neighbors. Hence, in total, x sent just one message (either “Explore” or reply) to each of its neighbors. This means that in total, the number of “Explore” and “Reply” messages is

| N (x) = 2m.

x∈V

We will consider now the overall cost of operations (1) and (3). In iteration i + 1, both broadcast and convergecast are performed on the tree constructed in iteration i, thus costing ni − 1 messages each, for a total of 2ni − 2 messages. Therefore, the total cost will be

2(ni − 1),

1≤i 0, x sends Explore(j + 1, k − 1) to all its neighbors except its parent. If k = 0, then a positive reply Positive(j ) is sent to the parent y. 2. Let j > levelx . In this case, this is not a shorter path to x; x replies with a negative acknowledgment Negative(j ). When x receives a reply from its neighbor z: 1. If the level of the reply is (levelx + 1) then: (a) if the reply is Negative(levelx + 1), then x considers z a non-child. (b) if the reply is Positive(levelx + 1) then x considers z a child. (c) If, with this message, x has now received a reply with level (levelx + 1) from all its neighbors except its parent, then it sends Positive(levelx ) to its parent. 2. If the level of the reply is not (levelx + 1) then the message is discarded. FIGURE 4.10: Exploration phase of BF Levels: x is not part of the current fragment

246

MESSAGE ROUTING AND SHORTEST PATHS

Correctness During the extension phase all the nodes at distance at most t + l from the root are indeed reached, as can be easily veriﬁed (Exercise 4.6.23). Thus, to prove the correctness of the protocol, we need just to prove that those nodes will be attached to the existing fragment at the proper level. We will prove this by induction on the levels. First of all, all the nodes at level t + 1 are neighbors of the sources and thus each will receive at least one Explore(t + 1, l) message; when this happens, regardless of whatever has happened before, each will set its level to t + 1; as this is the smallest level that they can ever receive, their level will not change during the rest of the iteration. Let it be true for the nodes up to level t + k, 1 ≤ k ≤ l − 1; we will show that it also holds for the nodes in level t + k + 1. Let π be the path of length t + k + 1 from s to x and let u be the neighbor of x in this path; by deﬁnition, u is at level t + k and, by inductive hypothesis, it has correctly set (levelu ) = t + k. When this happened, u sent a message Explore(t + k + 1, l − k − 1) to all its neighbors, except its parent. As x is clearly not u’s parent, it will eventually receive this message; when this happens, x will correctly set (levelx ) = t + k + 1. So we must show that the expansion phase will not terminate before x receives this message. Focus again on node u; it will not send a positive acknowledgment to its parent (and thus the phase can not terminate) until it receives a reply from all its other neighbors, including x. As, to reply, x must ﬁrst receive the message, x will correctly set its level during the phase. Cost To determine the cost of protocol BF Levels, we need to analyze the cost of the synchronization and of the expansion phases. The cost of a synchronization, as we discussed earlier, is at most 2(n − 1) messages, as both the initialization broadcast and the termination convergecast are performed on the currently available tree. Hence, the total cost of all synchronization activities depends on the number of iterations. This quantity is easily determined. As there are radius(r) < d(G) levels, and we add l levels in every iteration, except in the last where we add the rest, the number of iterations is at most d(G)/ l. This means that the total amount of messages due to synchronization is at most 2(n − 1)

d(G) l

≤ 2

(n − 1)2 . l

(4.9)

Let us now analyze the cost of the expansion phase in iteration i, 1 ≤ i ≤ d(G)/ l. Observe that in this phase, only the nodes in the levels L(i) = {(i − 1)l + 1, (i − 1)l + 2, . . . , il − 1, il} as well as the sources (i.e., the nodes at level (i − 1)l) will be involved, and messages will only be sent on the mi links between them. The messages sent during this phase will be just Explore(t + 1, l), Explore(t + 2, l − 1), Explore(t + 3, l − 2), . . . , Explore(t + l, 0), and the corresponding replies will be Positive(j ) or Negative(j ), t + 1 ≤ j ≤ t + l. A node in one of the levels in L(i) sends to its neighbors at most one of each of those Explore messages; hence there will be on each of edge at most 2l Explore messages (l in each direction), for a total of 2lmi . As for each Explore there is at most one reply, the total number of messages sent in this phase will be no more than 4lmi .

SHORTEST PATH ROUTING

247

This fact, observing that the set of links involved in each iteration are disjoint, yields less than d(G)/ l

4 l mi = 4 l m

(4.10)

i=1

messages for all the explorations of all iterations. Combining equations (4.9) and (4.10), we obtain

M[BF Levels] ≤

2(n − 1)d(G) + 4 l m. l

(4.11)

√ If we choose l = O(n/ m), expression (4.11) becomes M[BF Levels]= O(n

√ m).

This formula is quite interesting. In fact, it depends not only on n but also on the square root of the number m of links. If the network is sparse (i.e., it has O(n) links), then the protocol uses only O(n1.5 ) messages; note that this occurs in any planar network. The worst case will be with very dense networks (i.e., m = O(n2 )). However, in this case the protocol will use at most O(n2 ) messages, which is no more than protocol BF . In other words, protocol BF Levels will have the same cost as protocol BF only for very dense networks and will be much better in all other systems; in particular, whenever m = o(n2 ), it uses a subquadratic number of messages. Let us consider now the ideal time costs of the protocol. Iteration i consists of reaching levels L(i) and returning to the root; hence the ideal time will be exactly 2il if 1 ≤ i < d(G)/ l, and time 2d(G) in the last iteration. Thus, without considering the roundup, in total we have

T[BF Levels] =

d(G)/ l i=1

2li =

d(G)2 + d(G). l

(4.12)

√ The choice l = O(n/ m) we considered when counting the messages will give √ T[BF Levels]= O(d(G)2 m/n),

248

MESSAGE ROUTING AND SHORTEST PATHS

TABLE 4.6: Summary: Costs of Constructing a Breadth-ﬁrst Tree Network General General Planar

Algorithm BF BF Levels BF Levels

Messages O(m + √ nd) O(n m) O(n1.5 )

Time 2 O(d √ ) O(d 2 √ m/n + d) O(d 2 / n + d)

which, again, is the same ideal time as protocol BF only for very dense networks, and less in all other systems. Reducing Time with More Messages () If time is of paramount importance, better results can be obtained at the cost of more messages. For example, if in protocol BF Levels we were to choose l = d(G), we would obtain an optimal time costs. T[BF Levels]= 2d(G). IMPORTANT. We measure ideal time considering a synchronous execution where the communication delays are just one unit of time. In such an execution, when l = d(G), the number of messages will be exactly 2m + n − 1 (Exercise 4.6.25). In other words, in this synchronous execution, the protocol has optimal message costs. However, this is not the message complexity of the protocol, just the cost of that particular execution. To measure the message complexity we must consider all possible executions. Remember that to measure ideal time we consider only synchronous executions, while to measure message costs we must look at all possible executions, both synchronous and asynchronous (and choose the worst one). The cost in messages choosing l = d(G) is given by expression (4.11) that becomes O(m d(G)). This quantity is reasonable only for networks of small degree. By the way, a priori knowledge of d(G) is not necessary to obtain these bounds (either time or messages; Exercise 4.6.24). If we are willing to settle for a low but suboptimal time, it is possible to achieve it with a better message complexity. Let us see how. In protocol BF Levels the network (and thus the tree) is viewed as divided into “strips,” each containing l levels of the tree. See Figure 4.11. The way the protocol works right now, in the expansion phase, each source (i.e., each leaf of the existing tree) constructs its own bf-tree over the nodes in the next l levels. These bf-trees have differential growth rates, some growing quickly, some slowly. Thus, it is possible for a quickly growing bf-tree to have processed many more levels than a slower bf-tree. Whenever there are conﬂicts due to transmission delays (e.g., the arrival of a message with a better level) or concurrency (e.g., the arrival of another message with the same level), these conﬂicts are resolved, either

SHORTEST PATH ROUTING

249

s l

l l l l

FIGURE 4.11: We need more efﬁcient expansion of l levels in each iteration.

by “trowing away” everything already done and joining the new tree or sending a negative reply. It is the amount of work performed to take care of these conﬂicts that drives the costs of the protocol up. For example, when a node joins a bf-tree and has a (new) parent, it must send out messages to all its other neighbors; thus, if a node has a high degree and frequently changes trees, these adjacent edge messages dominate the communication complexity. Clearly, the problem is how to perform these operations efﬁciently. Conﬂicts and overlap occurring during the constructions of those different bf-trees in the l levels can be reduced by organizing the sources into clusters and coordinating the actions of the sources that are in the same cluster, as well as coordinating the different clusters. This in turn requires that the sources in the same cluster must be connected so as to minimize the communication costs among them. The connection through a tree is the obvious option and is called a cover tree. To avoid conﬂicts, we want that for different clusters the corresponding cover trees have no edges in common. So we will have a forest of cover trees, which we will call the cover of all the sources. To coordinate the different clusters in the cover, we must be able to reach all sources; this, however, can already be done using the current fragment (recall, the sources are the leaves of the fragment). The message costs of the expansion phase will grow with the number of different clusters competing for the same node (the so-called load factor); on the contrary, the time costs will grow with the depth of the cover trees (the so-called depth factor). Notice that it is possible to obtain tradeoffs between the load factor and the depth factor by varying the size of the cover (i.e., the number of trees in the forest), for example, increasing the size of the forest reduces the depth factor while increasing the load factor. We are thus faced with the problem of constructing clusters with small amount of competition and shallow cover trees. Achieving this goal yields a time cost of O(d 1+ ) and a message cost of O(m1+ ) for any ﬁxed > 0. See Exercise 4.6.26.

250

MESSAGE ROUTING AND SHORTEST PATHS

4.2.6 Suboptimal Solutions: Routing Trees Up to now, we have considered only shortest-path routing, that is, we have been looking at systems that always route a message to its destination through the shortest path. We will call such mechanisms optimal. To construct optimal routing mechanisms, we had to construct n shortest path trees, one for each node in the network, a task that we have seen is quite communication expensive. In some cases, the shortest path requirement is important but not crucial; actually, in many systems, guarantee of delivery with few communication activities is the only requirement. If the shortest path requirement is relaxed or even dropped, the problem of constructing a routing mechanism (tables and forwarding scheme) becomes simpler and can be achieved quite efﬁciently. Because they do not guarantee shortest paths, such solutions are called suboptimal. Clearly there are many possibilities depending on what (suboptimal) requirements the routing mechanism must satisfy. A particular class of solutions is the one using a single spanning tree of the network for all the routing, which we shall call routing tree. The advantages of such an approach are obvious: We need to construct just one tree. Delivery is guaranteed and no more that diam(T ) messages will be used on the tree T. Depending on which tree is used, we have different solutions. Let us examine a few. Center-Based Routing. As the maximum number of messages used to deliver a message is at most diam(T), a natural choice for a routing tree is the spanning tree with a small diameter. One such a tree is shortest path tree rooted in a center of the network. In fact, let c a center of G (i.e., a node where the maximum distance is minimized) and let PT(c) be the shortest path tree of c. Then (Exercise 4.6.27), diam(G) ≤ diam(PT(c)) ≤ 2diam(G). To construct such a tree, we need ﬁrst of all to determine a center c and then construct PT(c), for example, using protocol PT Construction. Median-Based Routing. Once we choose a tree T, an edge e = (x, y) of T linking the subtree T [x − y] to the subtree T [y − x] will be used every time a node in T [x − y] wants to send a message to a node in T [y − x], and viceversa (see Figure 4.12), where each use costs θ (e). Thus, assuming that overall every node generates the same amount of messages for every other node and all nodes overall generate the same amount of messages, the cost of using T for routing all this trafﬁc is Trafﬁc(T ) =

|T [x − y]| |T [y − x]| θ (x, y).

(x,y)∈T

It is not difﬁcult to see that such a measure is exactly the sum of all distances between nodes (Exercise 4.6.28). Hence, the best tree T to use is one that

SHORTEST PATH ROUTING

x

T [x−y]

251

y

T [y−x]

FIGURE 4.12: The message trafﬁc between the two subtrees passes through edge e = (x, y).

minimizes the sum of all distances between nodes. Unfortunately, to construct the minimum-sum-distance spanning tree of a network is not simple. In fact, the problem is NP-hard. Fortunately, it is not difﬁcult to construct a near-optimal solution. In fact, let z be a median of the network (i.e., a node for which the sum of distances SumDist(z) = v∈V dG (x, z) to all other nodes is minimized) and let PT(z) be the shortest path tree of z. If T is the spanning tree that minimizes trafﬁc, then (Exercise 4.6.29) Trafﬁc(PT(z)) ≤ 2 Trafﬁc(T ). Thus, to construct such a tree, we need ﬁrst of all to determine a median z and then construct PT(z), for example, using protocol PT Construction. Minimum-Cost Spanning-Tree Routing. A natural choice for routing tree is a minimum-cost spanning tree (MST) of the network. The construction of such a tree can be done, for example, using protocol MegaMerger discussed in Chapter 3. All the solutions above have different advantages; for example, the center-based one offers the best worst-case cost, while the median-based one has the best average cost. Depending on the nature of the systems and of the applications, each might be preferable to the others. There are also other measures that can be used to evaluate a routing tree. For example, a common measure is the so-called stretch factor σG (T ) of a spanning tree T of G deﬁned as σG (T ) = Maxx,y∈V

dT (x, y) . dG (x, y)

(4.13)

In other words, if a spanning tree T has a stretch factor α, then for each pair of nodes x and y, the cost of the path from x to y in T is at most α times the cost of the shortest path between x and y in G. A design goal could thus be to determine spanning trees with small stretch factors (see Exercises 4.6.30 and 4.6.31). These ratios are sometimes difﬁcult to calculate. Alternate, easier to compute, measures are obtained by taking into account only pairs of neighbors (instead of pairs of arbitrary nodes). One such measure is the

252

MESSAGE ROUTING AND SHORTEST PATHS

so-called dilation, that is the length of the longest path in the spanning tree T corresponding to an edge of G, deﬁned as dilationG (T) = Max(x,y)∈E dT (x, y).

(4.14)

We also can deﬁne the edge-stretch factor G (T ) (or dilation factor) of a spanning tree T of G as G (T ) = Max(x,y)∈E

dT (x, y) . θ(x, y)

(4.15)

As an example, consider the spanning tree PT(c) used in the center-based solution; if all the link costs are the same, we have that for every two nodes x and y 1 ≤ dG (x, y) ≤ dPT(c) (x, y) ≤ dPT(c) = dG . This means that in PT(c) (unweighted) stretch factor σG (T ), dilation dilationG (T ), and edge-stretch factor G (T ) are all bounded by the same quantity, the diameter dG of G. For a given spanning tree T, the stretch factor and the dilation factor measure the worst ratio between the distance in T and in G for the same pair of nodes and the same edge, respectively. Another important cost measure is the average stretch factor describing the average ratio: σ G (T ) = Averagex,y∈V

dT (x, y) dG (x, y)

(4.16)

and the average edge-stretch factor (or average dilation factor) G (T ) of a spanning tree T of G as G (T ) = Average(x,y)∈E

dT (x, y) . θ (x, y)

(4.17)

Construction of spanning trees with low average edge-stretch can be done effectively (Exercises 4.6.35 and 4.6.36). Summarizing, the main disadvantage of using a routing tree for all routing tasks is the fact that the routing path offered by such mechanisms is not optimal. If this is not a problem, these solutions are clearly a useful and viable alternative to shortest path routing. The choice of which spanning tree, among the many, should be used depends on the nature of the system and of the application. Natural choices include the ones described above, as well as those minimizing some of the cost measures we have introduced (see Exercises 4.6.31, 4.6.32, 4.6.33).

COPING WITH CHANGES

253

4.3 COPING WITH CHANGES In some systems, it might be possible that the cost associated to the links change over time; think, for example, of having a tariff (i.e., cost) for using a link during weekdays different from the one charged in the weekend. If such a change occurs, the shortest path between several pairs of node might change, rendering the information stored in the tables obsolete and possibly incorrect. Thus, the routing tables need to be adjusted. In this section, we will consider the problem of dealing with such events. We will assume that when the cost of a link (x, y) changes, both x and y are aware of the change and of the new cost of the link. In other words, we will replace the Total Reliability restriction with Total Component Reliability (thus, the only changes are in the costs) in addition to the Cost Change Detection restriction. Note that costs that change in time can also describe the occurrence of some link failures in the system: The crash failure of an edge can be described by having its cost becoming exceedingly large. Hence, in the following, we will talk of link crash failures and of cost changes as the same types of events. 4.3.1 Adaptive Routing In these dynamical networks where cost changes in time, the construction of the routing tables is only the ﬁrst step for ensuring (shortest path) routing: There must be a mechanism to deal with the changes in the network status, adjusting the routing tables accordingly. Map Update A simple, albeit expensive solution is the Map Update protocol. It requires ﬁrst of all that each table contains the complete map of the entire network; the next “hop” for a message to reach its destination is computed on the basis of this map. The construction of the maps can be done, for example, using protocol Map Gossip discussed in Section 4.2.1. Clearly, any change will render the map inaccurate. Thus, integral part of this protocol is the update mechanism: Maintenance as soon as an entity x detects a local change (either in the cost or in the status of an incident link), x will update its map accordingly and inform all its neighbors of the change through an “update” message; as soon as an entity y receives an “update” from a neighbor, it will update its map accordingly and inform all its neighbors of the change through an “update” message. NOTE. In several existing systems, an even more expensive periodic maintenance mechanism is used: Step 1 of the maintenance mechanism is replaced by having each node, periodically and even if there are no detected changes, send its entire map to all its neighbors. This is, for example, the case with the second Internet routing protocol:

254

MESSAGE ROUTING AND SHORTEST PATHS

The complete map is being sent to all neighbors every 10–60 s (10 s if there is a cost change; 60 s otherwise). The great advantage of this approach is that it is fully adaptive and can cope with any amount and type of changes. The clear disadvantage is the amount of information required locally and the volume of transmitted information. Vector Update To alleviate some of the disadvantages of the Map Update protocol, an alternative solution consists in using protocol Iterative Construction, that we designed to construct the routing tables, to keep them up-to-date should faults or changes occur. Every entity will just keep its routing table. Note that a single change might make all the routing tables incorrect. To complicate things, changes are detected only locally, where they occur, and without a full map it might be impossible to detect if it has any impact on a remote site; furthermore, if more several changes occur concurrently, their cumulative effect is unpredictable: A change might “undo” the damage inﬂicted to the routing tables by another change. Whenever an entity x detects a local change (either in the cost or in the status of an incident link), the update mechanism is invoked, which will trigger an execution of possibly several iterations of protocol Iterative Construction. In regard to the update mechanism, we have two possible choices: recompute the routing tables: everybody starts a new execution of the algorithm, trowing away the current tables, or update current information: everybody starts a new iteration of the algorithm with x using the new data, continuing until the tables converge. The ﬁrst choice is very costly because, as we know, the construction of the routing tables is an expensive process. For these reasons, one might want to recompute only what and when is; hence the second choice is preferred. The second choice was used as the original Internet routing protocol; unfortunately, it has some problems. A well known problem is the so-called count-to-inﬁnity problem. Consider the simple network shown in Figure 4.13. Initially all links have cost 1. Then the cost of link (z, w) becomes a large integer K >> 1. Both nodes z and w will then start an iteration that will be performed by all entities. During this iteration, z is told by y that there is a path from y to w of cost 2; hence, at the end of the iteration, z sets its distance to w to 3. In the next iteration, y sets its distance from w to 4 because the best path to w (according to the vectors it receives from x and z) is through x. In general, after the (2i + 1)th iteration, x and z will set their cost for reaching w to 2(i + 1) + 1, while z will set it to 2(i + 1). This process will continue until z sets its cost for w

x

1

y

1

z

1

K

FIGURE 4.13: The count-to-inﬁnity problem.

w

COPING WITH CHANGES

255

to the actual value K. As K can be arbitrarily large, the number of iterations can be arbitrarily large. Solving this problem is not easy. See Exercises 4.6.38 and 4.6.39. Oscillation We have seen some approaches to maintain routing information in spite of failures and changes in the system. A problem common to all the approaches is called oscillation. It occurs if the cost of a link is proportional to the amount of trafﬁc on the link. Consider, for example, two disjoint paths π1 and π2 between x and y, where initially π1 is the “best” path. Thus, the trafﬁc is initially sent to π1 ; this will have the effect of increasing its cost until π2 becomes the best path. At this point the trafﬁc will be diverted on π2 increasing its cost, and so forth. This oscillation between the two paths will continue forever, requiring continuous execution of the update mechanism. 4.3.2 Fault-Tolerant Tables To continue to deliver a message through a shortest path to its destination in presence of cost changes or link crash failures, an entity must have up-to-date information on the status of the system (e.g., which links are up, their current cost, etc.). As we have seen, maintaining the routing tables correct when the topology of the network or the edge values may change is a very costly operation. This is true even if faults are very limited. Consider, for example, a system where at any time there is at most one link down (not necessarily the same one at all times), and no other changes will ever occur in the system; this situation is called single link crash failure (SLF). Even in this restricted case, the amount of information that must be kept in addition to the shortest paths is formidable (practically the entire map). This is because the crash failure of a single edge can dramatically change all the shortest path information. As the tables must be able to cope with every possible choice of the failed link, even in such a limited case, the memory requirements soon become unfeasible. Furthermore when a link fails, every node must be notiﬁed so that it can route messages along the new shortest paths; the subsequent recovery of that node also will require such a notiﬁcation. Such a notiﬁcation process needs to be repeated at each crash failure and recovery, for the entire lifetime of the system. Hence, the amount of communication is rather high and never ending as long as there are changes. Summarizing, the service of delivering a message through a shortest path in presence of cost changes or link crash failures, called shortest path rerouting (SR), is expensive (sometimes to the point of being unfeasible) both in terms of storage and communication. The natural question is whether there exists a less expensive alternative. Fortunately, the answer is positive. In fact, if we relax the shortest path rerouting requirement and settle for lower quality services, then the situation changes drastically; for example, as we will see, if the requirement is just message delivery (i.e., not necessarily through a shortest path), this service be achieved in our SLF system with very simple routing tables and without any maintenance mechanism.

256

MESSAGE ROUTING AND SHORTEST PATHS

In the rest of this section, we will concentrate on the single-link crash failure case. Point-of-failure Rerouting To reduce the amount of communication and of storage, a simple and convenient alternative is to offer, after the crash failure of an arbitrary single link, a lower quality service called point-of-failure rerouting (PR): Point-of-failure (Shortest path) Rerouting: 1. if the shortest path is not affected by the failed link, then the message will be delivered through that path; 2. otherwise, when the message reaches the node where the crash failure has occurred (the “point of failure”), the message will then be rerouted through a (shortest) path to its destination if no other failure occurs. This type of service has clearly the advantage that there is no need to notify the entities of a link crash failure and its subsequent reactivation (if any): The message is forwarded as there are no crash failures and if, by chance, the next link it must take has failed, it will be just then provided with an alternative route. This means that once constructed with the appropriate information for rerouting, the routing tables do not need to be maintained or updated. For this reason, the routing tables supporting such a service are called fault-tolerant tables. The amount of information that a fault-tolerant table must contain (in addition to the shortest paths) to provide such a service will depend on what type of information is being kept at the nodes to do the rerouting and on whether or not the rerouting is guaranteed to be through a shortest path. A solution consists in every node x knowing two (or more) edge-disjoint paths for each destination: the shortest path, and a secondary one to be used only if the link to the next “hop” in the shortest path has failed. So the routing mechanism is simple: When a message for destination r arrives at x, x determines the neighbor y in the shortest path to r. If (x,y) is up, x will send the message to y, otherwise, it will determine the neighbor z in the secondary path to r and forward the message to z. The storage requirements of this solution are minimal: For each destination, a node needs to store in its routing table only one link in addition to the one in the fault-free shortest path. As we already know how to determine the shortest path trees, the problem is reduced to the one of computing the secondary paths (see Exercise 4.6.37). NOTE. The secondary paths of a node do not necessarily form a tree. A major drawback of this solution is that rerouting is not through a shortest path: If the crash failure occurs, the system does not provide any service other than message delivery. Although acceptable in some contexts, this level of service might not be

COPING WITH CHANGES

257

tolerable in general. Surprisingly, it is actually possible to offer shortest path rerouting storing at each node only one link for each destination in addition to the one in the fault-free shortest path. We are now going to see how to design such a service. Point-of-Failure Shortest Path Rerouting Consider a message originated by x and whose destination is s; its routing in the system will be according to the information contained in the shortest path spanning tree PT(s). The tree PT(s) is rooted in s; so every node x = s has a parent ps (x), and every edge in PT(s) links a node to its parent. When the link es [x] = (ps (x), x) fails, it disconnects the tree into two subtrees, one containing s and the other x; call them T [s − x] and T [x − s]; see Figure 4.14. When ex fails, a new path from x to s must be found. It cannot be any: It must be the shortest path possible between x and s in the network without es [x]. Consider a link e = (u, v) ∈ G \ PT(s), not part of the tree, that can reconnect the two subtrees created by the crash failure of es [x], that is, u ∈ T [s − x] and v ∈ T [x − s]. We will call such a link a swap edge for es [x]. Using e we can create a new path from x to s. The path will consist of three parts: the path from x to v in T [x/ex ], the edge (u, v), and the path from u to s; see Figure 4.15. The cost of going from x to s using this path will then be dPT(s) (s, u) + θ(u, v) + dPT(s) (v, x) = d(s, u) + θ (u, v) + d(v, x). This is the cost of using e as a swap for es [x]. For each es [x] there are several edges that can be used as swaps, each with a different cost. If we want to offer shortest path rerouting from x to s when es [x] fails, we must use the optimal swap, that is the swap edge for es [x] of minimum cost.

s

p (x) s

x

T [s−x]

T [x−s]

FIGURE 4.14: The crash failure of es [x] = (ps (x), x) disconnects the tree P T (s).

258

MESSAGE ROUTING AND SHORTEST PATHS

s

x

u

v

FIGURE 4.15: Point-of-failure rerouting using the swap edge e = (u, v) of es [x].

So the ﬁrst task that must be solved is to how ﬁnd the optimal swap for each edge es [x] in PT(s). This computation can be done efﬁciently (Exercises 4.6.40 and 4.6.41); its result is that every node x knows the optimal swap edge for its incident link es [x]. To be used to construct the routing tables, this process must be repeated n times, one for each destination s (i.e., for each shortest path spanning tree PT(s)). Once the information about the optimal swap edges has been determined, it needs to be integrated in the routing tables so as to provide point-of-failure shortest path rerouting. The routing table of a node x must contain information about (1) the shortest paths as well as about (2) the alternative paths using the optimal swaps: 1. Shortest path information. First and foremost, the routing table of x contains for each destination s the link to the neighbor in the shortest path to s if there are no failures. Denote by ps (x) this neighbor. The choice of symbol is not accidental: This neighbor is the parent of x in PT(s) and the link is really es [x] = (ps (x), x). 2. Alternative path information. In the entry for the destination s, the routing table of x must also contain the information needed to reroute the message if es [x] = (ps (x), x) is down. Let us see what this information is. Let e = (u, v) be the optimal swap edge that x has computed for es [x]; this means that the shortest path from x to s if es [x] fails is by ﬁrst going from x to v, then over the link (u, v), and ﬁnally from u to s. In other words, if es [x] fails, x must reroute the message for s to v, that is, x must send it to its neighbor in the shortest path to v. The shortest paths to v are described by the tree PT(v); in fact, this neighbor is just pv (x) and the link over which the message to s must be sent when rerouting is precisely ev [x] = (pv (x), x) (see Exercise 4.6.42). Concluding, the additional information x must keep in the entry for destination s are the rerouting link ev [x] = (pv (x), x) and the closest node v on the optimal swap edge for es [x]; this information will be used only if es [x] is down.

COPING WITH CHANGES

259

TABLE 4.7: Entry in the Routing Table of x; e=(u, v) is the Optimal Swap Edge for es [x] Final Destination

Normal Link

Rerouting Link

Swap Destination

Swap Link

s

(ps (x), x)

(pv (x), x)

v

(u,v)

Any message must thus contain, in addition to the ﬁnal destination (node s in our example), also a ﬁeld indicating the swap destination (node v in our example), the swap link (link (u, v) in our example), and a bit to explain which of the two must be considered (see Table 4.7). The routing mechanism is rather simple. Consider a message originating from r for node s. PSR Routing Mechanism 1. Initially, r sets the ﬁnal destination to s, the swap destination and the swap link to empty, and the bit to 0; it then sends the message toward the ﬁnal destination using the normal link indicated in its routing table. 2. If a node x receives the message with ﬁnal destination s and bit set to 0, then (a) if x = s, the message has reached its destination: s processes the message; (b) if es [x] = (ps (x), x) is up, x forwards the unchanged message on that link; (c) if es [x] = (ps (x), x) is down, then x i. copies to the swap destination and swap link ﬁelds of the message the swap destination and swap link entries for s in its routing table; ii. sets the bit to 1; iii. sends the message on the rerouting link indicated in its table. 3. If a node x receives the message with ﬁnal destination s and bit set to 1, and swap destination set to v, then (a) if x = v, then i. it sets the bit to 0; ii. it sends the message on the swap link; (b) otherwise, it forwards the unchanged message on the link ev [x] = (pv (x), x). 4.3.3 On Correctness and Guarantees Adaptive Routing In all adaptive routing approaches, maintenance of the tables is carried out by broadcasting information about the status of the network; this can

Destination

Mode

SwapDest

SwapLink

Content

s

1

v

(u, v)

INFO

FIGURE 4.16: Message rerouted by x using the swap edge e =(u, v) of es [x].

260

MESSAGE ROUTING AND SHORTEST PATHS

be done periodically or just when changes do occur. In all cases, news of changes detected by a node will eventually reach any node (still connected to it). However, because of time delays, while an update is being disseminated, nodes still unaware will be routing messages on the basis of incorrect information. In other words, as long as there are changes occurring in the system (and for some time afterwards), the information in the tables is unreliable and might be incorrect. In particular, it is likely that routing will not be done through a shortest path; it is actually possible that messages might not be delivered as long as there are changes. This sad status of affairs is not due to the individual solutions but solely due to the fact that time delays are unpredictable. As a result, it is impossible to make any guarantee on correctness and in particular on shortest path delivery for adaptive routing mechanisms. This situation occurs even if the changes at any time are few and their nature limited, as the SLF. It would appear that we should be able to operate correctly in such a system; unfortunately this is not true: It is impossible to provide shortest path routing even in the single-link crash failure case. This is because the crash failure of a single edge can dramatically change all the shortest path information; thus, when the link fails, every node must be notiﬁed so that it can route messages along the new shortest paths; the subsequent recovery of that node will also require such a notiﬁcation. Such a notiﬁcation process needs to be repeated at each crash failure and recovery, and again the unpredictable time delays will make it impossible to guarantee correctness of the information available at the entities, and thus of the routing decision they make on the basis of that information. Question. What, if anything, can be guaranteed? The only think that we can say is that, if the changes stop (or there are no changes for a long period of time), then the updates to the routing information converge to the correct state, and routing will proceed according to the existing shortest paths. In other words, if the “noise” caused by changes stops, eventually the entities get the correct result. Fault-Tolerant Tables In the fault-tolerant tables approach, no maintenance of the routing tables is needed once they have been constructed. Therefore, there are no broadcasts or notiﬁcations of changes that, because of delays, might affect the correctness of the routing. However, also, fault-tolerant tables suffer because of the unpredictability of time delays. For example, even with the single-link crash failure, point-of-failure shortestpath rerouting can not be guaranteed to be correct: While the message for s is being rerouted from x toward the swap edge es [x], the link es [x] might recover (i.e., come up again) and another link on the may go down. Thus, the message will again be rerouted and might continue to do so if a “bad” sequence of recovery failure occurs.

ROUTING IN STATIC SYSTEMS: COMPACT TABLES

261

In other words, not only the message will not reach s through a shortest path from the ﬁrst point-of-failure, but it will not reach s at all as long as there is a change. It might be argued that such a sequence of events is highly unlikely, but it is possible. Thus, again, Question. What, if anything, can be guaranteed? As in the case of adaptive routing, the only guarantee is that if the changes stop (or there are no changes for a long period of time), then messages will be (during that time) correctly delivered through point-of-failure shortest paths. 4.4 ROUTING IN STATIC SYSTEMS: COMPACT TABLES There are systems that are static in nature; for example, if Total Reliability holds, no changes will occur in the network topology. We will consider static also any system where the routing table, once constructed, cannot be modiﬁed (e.g., because they are hardcoded/hardwired). Such is, for example, any system etched on a chip; should faults occur, the entire chip will be replaced. In these systems, an additional concern in the design of shortest path routing tables is their size, that is, an additional design goal is to construct table that are as small as possible. 4.4.1 The Size of Routing Tables The full routing table can be quite large. In fact, for each of its n − 1 destinations, it contains the speciﬁcation (and the cost) of the shortest path to that destination. This means that each entry possibly contains O(n log w) bits, where w ≥ n is the range of the entities’ names, for a total table size of O(n2 log w) bits. Assuming the best possible case, that is, w = n, the number of bits required to store all the n full routing tables is SFULL = O(n3 log n). For large n, this is a formidable amount of space just to store the routing tables. Observe that for any destination, the ﬁrst entry in the shortest path will always be a link to a neighbor. Thus, it is possible to simplify the routing table by specifying for each destination y only the neighbor of x on the shortest path to it. Such a table is called short. For example, the short routing table for s in the network of Figure 4.1 is shown in Table 4.8. In its short representation, each entry of the table of an entity x will contain log w bits to represent the destination’s name and another log w bits to represent the neighbor’s name. In other words, the table contains 2(n − 1) log w bits. Assuming the best possible case, that is, w = n , the number of bits required to store all the routing tables is 2n(n − 1) log n.

262

MESSAGE ROUTING AND SHORTEST PATHS

TABLE 4.8: Short Representation of RT(s) Destination

Neighbor

h k c d e f

h h c c e e

This amount of space can be further reduced if, instead of the neighbors’ names we use the local port numbers leading to them. In this case, the size will be (n − 1) (log w + log px ) bits, where px ≥ deg(x) is the range of the local port numbers of x. Assuming the best possible case, that is, w = n and px = deg(x) for all x, this implies that the number of bits required to store all the routing tables is at least SSHORT =

x

(n − 1) log deg(x) = (n − 1) log ⌸x deg(x),

which can be still rather large. Notice that the same information can be represented by listing for each port the destinations reached via shortest path through that port; for example, see Table 4.9. This alternative representation of RT(x) uses only deg(x) + (n − 1) log(n) bits for a total of SALT =

(deg(x) + (n − 1) log n) = 2m + n(n − 1) log n.

(4.18)

x

It appears that there is not much more that can be done to reduce the size of the table. This is, however, not the case if we, as designers of the system, had the power to choose the names of the nodes and of the links. 4.4.2 Interval Routing The question we are going to ask is whether it is possible to drastically reduce this amount of storage if we know the network topology and we have the power of choosing the names of the nodes and the port labels. An Example: Ring Networks Consider for example a ring network, and assume for the moment that all links have the same cost. TABLE 4.9: Alternative Short Representation of RT(s) Port

Destinations

ports (h) ports (c) ports (e)

h, k c, d e, f

ROUTING IN STATIC SYSTEMS: COMPACT TABLES

0

263

right 1

6

2

5

right

3, 4, 5, 6

left

7, 8, 0, 1

3

5 4

(a)

(b)

FIGURE 4.17: (a) assigning names and labels; (b) Routing table of node 2.

Suppose that we assign as names to the nodes consecutive integers, starting from 0 and continuing clockwise, and we label the ports right or left depending on whether or not they are in the clockwise direction. See Figure 4.17(a). Concentrate on node 0. This node, like all the others, has only two links. Thus, whenever 0 has to route a message for z > 0, it must just decide whether to send it to right or to left. Observe that the choice will be right for 1 ≤ z ≤ n/2 and left for n/2 + 1 ≤ z ≤ n − 1. In other words, the destinations are consecutive integers (modulo n). This is true not just for node 0: If x has to route a message for z = x, the choice will be right if z is in the interval x + 1, x + 2, . . . x + n/2 and left if z is in the interval x + n/2 + 1, . . . , x − 1, where the operations are modulo n. See Figure 4.17(b). In other words, in all these routing tables, the set of destinations associated to a port is an interval of consecutive integers, and, in each table, the intervals are disjoint. This is very important for our purpose of reducing the space. In fact, an interval has a very short representation: It is sufﬁcient to store the two end values, that is, just 2 log n bits. We can actually do it with just log n bits; see Exercise 4.6.43. As a table consists just of two intervals, we have routing tables of 4 log n bits each, for a grand total of just 4n log n. This amount should be contrasted with the one of Expression 4.18 that, in the case of rings, becomes n2 log n + l.o.t.. In other words, we are able to go from quadratic

264

MESSAGE ROUTING AND SHORTEST PATHS

to just linear space requirements. Note that it is true even if the costs of the links are not all the same; see Exercise 4.6.44. The phenomenon we have just described is not isolated, as we will discuss next. Routing With Intervals Consider the names of the nodes in a network G. Without any loss of generality, we can always assume that the names are consecutive positive integers, starting from 0, that is, the set of names is Zn = {0, 1, . . . , n − 1}. Given two integers j, k ∈ Zn , we denote by (j, k) the sequence (j, k) = j, j + 1, j + 2, . . . , k if j < k (j, k) = j, j + 1, j + 2, . . . , n − 1, 0, 1, . . . , k if j ≥ k. Such a sequence (j, k) is called a circular interval of Zn ; the empty interval ∅ is also an interval of Zn . Suppose that we are able to assign names to the nodes so that the shortest path routing tables for G have the following two properties. At every node x, 1. interval: for each link incident to x, the (names of the) destinations associated to that link form a circular interval of Zn ; 2. disjointness: each destination is associated to only one link incident to x. If this is the case, then we can have for G a very compact representation of the routing tables, like in the example of the ring network. In fact, for each link the set of destinations is an interval of consecutive integers, and, like in the ring, the intervals associated to the links of a given nodes are all disjoint. In other words, each table consists of a set of intervals (some of them may be empty), one for each incident link. From the storage point of view, this is very good news because we can represent such intervals by just their start values (or, alternatively, by their end values). In other words, the routing table of x will consist of just one entry for each of its links. This means that the amount of storage for its table is only deg(x) log n bits. In turn, this means that the number of bits used in total to represent all the routing tables will be just SINTERVAL =

deg(x) log n = 2m log n.

(4.19)

x

How will the routing mechanism then work with such tables? Suppose x has a message whose destination is y. Then x checks in its table which interval y is part of (as the intervals are disjoint, y will belong to exactly one) and sends the message to the corresponding link. Because of its nature, this approach is called interval routing. If it can be done, as we have just seen, it allows for efﬁcient shortest-path routing with a minimal amount of storage requirements.

ROUTING IN STATIC SYSTEMS: COMPACT TABLES

265

15

3

2

0

8

6

1

14

9

7

4

5

10

13

11

12

FIGURE 4.18: Naming for interval routing in trees

It, however, requires that we, as designers, ﬁnd an appropriate way to assign names to nodes so that the interval and disjointness properties hold. Given a network G, it is not so obvious how to do it or whether it can be done at all. Tree Networks First of all we will consider tree networks. As we will see, in a tree it is always possible to achieve our goal and can actually be done in several different ways. Given a tree T, we ﬁrst of all choose a node s as the source, transforming T into the tree T (s) rooted in s; in this tree, each node x has a parent and some children (possibly none). We then assign as names to the nodes consecutive integers, starting from 0, according to the post-order traversal of T (s), for example, using procedure P ost Order Naming(x, k) begin Unnamed Children(x):= Children(x); while Unnamed Children(x) = ∅ do y ← Unnamed Children(x); P ost Order N aming(y, k) endwhile myname:= k; k:= k + 1; end started by calling Post Order Naming(s, 0). This assignment of names has several properties. For example, any node has a larger name than all its descendents. More importantly, it has the interval and disjointness properties (Exercise 4.6.48). Informally, the interval property follows is because when executing Post Order Naming with input (x, k), x and its descendents will be given as names consecutive integers starting from k. See for example Figure 4.19.

266

MESSAGE ROUTING AND SHORTEST PATHS

< 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3 >

8

< 4, 5, 6 >

FIGURE 4.19: Disjoint intervals

Special Networks Most regular network topologies we have considered in the past can be assigned names so that interval routing is possible. This is for example the case of the p × q mesh and torus, hypercube, butterﬂy, and cube-connected-cycles; see Exercises 4.6.51 and 4.6.52. For these networks the construction is rather simple. Using a more complex construction, names can be assigned so that interval routing can be done also in any outerplanar graph (Exercise 4.6.53); recall that a graph is outerplanar if it can be drawn in the plane with all the nodes lying on a ring and all edges lying in the interior of the ring without crossings. Question. Can interval routing be done in every network? The answer is unfortunately No. In fact there exist rather simple networks, the socalled globe outerplanar graph (one is shown in Figure 4.20), for which interval routing is impossible (Exercise 4.6.55). Multi-Intervals As we have seen, interval routing is a powerful technique but the classes of networks in which it is possible are rather limited. To overcome somehow this limitation without increasing excessively the size of the routing table an approach is to associate to each link a small number of intervals. An interval-routing scheme that uses up to k intervals per edge is called a k-intervals routing scheme.

FIGURE 4.20: A globe graph: interval routing is not possible.

BIBLIOGRAPHICAL NOTES

267

Clearly, with enough intervals we can ﬁnd a scheme for every connected graph. The question is whether this can be achieved with a small k. The answer again is No. In fact, there are graphs where O(n) intervals are needed in each edge (Exercise 4.6.56). Suboptimal Interval Routing A reason why it is impossible to do interval routing in all graphs is that we require the tables to provide shortest path. The situation changes if we relax this requirement. If we ask the tables to provide us just with a path to destination, not necessarily the shortest one, then we can use the approach already discussed in Section 4.2.6: We construct a single spanning tree T of the network G and use only the edges of T for routing. Once we have the tree T, we then assign the names to the nodes using the naming algorithm for trees that provides interval routing. In this way, we obtain for G the very compact routing tables provided by interval routing. Clearly, the interval routing mechanism so constructed is optimal (i.e., shortest path) for the tree T but not necessarily so for the original network G. This means that suboptimal interval routing is always possible in any network. Question. How much worse can a path provided by this approach be than the shortest one to the destination? If we choose as tree T a breadth-ﬁrst spanning tree rooted in a center of the graph G, then its diameter is at most twice the diameter of the original graph (the worst case is when G is a ring). This means that the longest route is never more than 2 diam(G). We can extend this approach by allowing the longest route to be within a factor β ≤ 2 of the diameter of G and by using more than one interval. We have seen that it is possible to obtain β = 2 using a single interval per edge. The question then becomes whether using more intervals we can obtain a better scheme (i.e., a smaller β). The answer is again not very positive; for example, to have the longest route shorter than 3 2 diam(G), then we need O(log n) labels (Exercise 4.6.58). 4.5 BIBLIOGRAPHICAL NOTES The construction of routing table is a prerequisite for the functioning of many networks. One of the earliest protocols is due to William Tajibnapis [31]. The basic MapGossip for the construction of all routing tables is due to Eric Rosen [29]. Protocol IteratedConstruction is the distributed version of Bellman’s sequential algorithm designed by Lestor Ford and D. Fulkerson [13]; from the start it has been the main routing algorithm in the Internet. The same cost as IteratedConstruction, O(n2 m), was incurred by several other protocols designed much later, including the ones of Philip Merlin and Adrian Segall [25] and by Jayadev Misra and Mani Chandy [22]. The improvement to O(n3 ) is due to Baruch Awerbuch, who designed a protocol to construct a single shortest path tree

268

MESSAGE ROUTING AND SHORTEST PATHS

using O(n2 ) message [6]. The same bound is achieved by protocol PT Construction, the efﬁcient distributed implementation of Dijkstra’s sequential algorithm designed by K. Ramarao and S. Venkatesan [28]. The even more efﬁcient Protocol SparserGossip is due to Yeuda Afek and Moty Ricklin [1]. A protocol for systems allowing long messages was designed by Sam Toueg with cost O(nm) [32]; the reduction to O(n2 ) is easy to achieve using protocol MapGossip by Eric Rosen [29] (Exercise 4.6.4), constructing, however, complete maps at each entity; the same cost but with less local storage (Exercise 4.6.18) has been obtained by S. Haldar [20]. The distributed construction of min-hop spanning trees has been extensively investigated. Protocol BF (known as the “Coordinated Minimum Hop Algorithm”) is due to Bob Gallager [17]; a different protocol with the same cost was independently designed by To-Yat Cheung [8]. Also to Gallager [17] is due the idea of reducing time by partitioning the layers of the breadth-ﬁrst tree into groups (Section 4.2.5) and a series of time-messages tradeoffs. Protocol BF Layers has been designed by Greg Frederickson [15]. The problem of reducing time while maintaining a reasonable message complexity has been investigated by Baruch Awerbuch [3], Baruch Awerbuch and Bob Gallager [5], and Y. Zhu and To-Yat Cheung [35]. The near-optimal bounds (Exercise 4.6.26) have been obtained by Baruch Awerbuch [4]. The suboptimal solutions of center-based and median-based routing were ﬁrst discussed in details by David Wall and Susanna Owicki [34]. The lower-bound on average edge-stretch and the construction of spanning trees with low average edgestretch (Exercises 4.6.34, 4.6.35 and 4.6.36) are due to Noga Alon, Richard Karp, David Peleg, and Doug West [2]. The idea of point-of-failure rerouting was suggested independently by Enrico Nardelli, Guido Proietti, and Peter Widmayer[27] and by Hiro Ito, Kazuo Iwama, Yasuo Okabe, and Takuya Yoshihiro [21]. The distributed algorithm for computing the swap edges (Exercise 4.6.41) was designed by Paola Flocchini, Linda Pagli, Tony Mesa, Giuseppe Prencipe, and Nicola Santoro [12]. The idea of compact routing was introduced by Nicola Santoro and Ramez Kathib [30], who designed the interval routing for trees; this idea was then extended by Jan van Leeuwen and Richard Tan [24]. The interval routing for outerplanar graphs (Exercise 4.6.53) is due to Greg Frederickson and Ravi Janardan [16]. The more restrictive notion of linear interval routing (Exercise 4.6.54 and Problem 4.6.1) was introduced and studied by Erwin Bakker, Jan van Leeuwen, and Richard Tan [7]; the more general notion of Boolean routing was introduced by Michele Flammini, Giorgio Gambosi, and Sandro Salomone [11]. Several issues of compact routing have been investigated, among others, by Greg Frederickson and Ravi Janardan [16], Pierre Fraigniaud and Cyril Gavoille [14], and Cyril Gavoille and David Peleg [19]. Exercises 4.6.56, 4.6.57, and 4.6.58 are due to Cyril Gavoille and Eric Guevremont [18], Evangelos Kranakis and Danny Krizanc [23], and Savio Tse and Francis Lau [33], respectively. Characterizations of networks supporting interval routing are due to Lata Narayanan and Sunil Shende [26], Tamar Eilam, Shlomo Moran, and Shmuel Zaks [9], and Michele Flammini, Giorgio Gambosi, Umberto Nanni, and Richard Tan [10].

EXERCISES, PROBLEMS, AND ANSWERS

269

4.6 EXERCISES, PROBLEMS, AND ANSWERS 4.6.1 Exercises Exercise 4.6.1 Write the set of rules corresponding to Protocol Map Gossip described in Section 4.2.1. Exercise 4.6.2 () Consider a tree network where each entity has a single item of information. Determine the time costs of gossiping. What would the time costs be if each entity x initially has deg(x) items? Exercise 4.6.3 Consider a tree network where each entity has f (n) items of information. Assume that messages can contain g(n) items of information (instead of O(1)); with how many messages can gossiping be performed? Exercise 4.6.4 Using your answer to question 4.6.3, with how many messages can all routing tables be constructed if g(n) = O(n)? Exercise 4.6.5 Consider a tree network where each entity has f (n) items of information. Assume that messages can contain g(n) items of information (instead of O(1)); with how many messages can all items of information be collected at a single entity? Exercise 4.6.6 Using your answer to question 4.6.5, with how many messages can all routing tables be constructed at that single entity if g(n) = O(n)? Exercise 4.6.7 Write the set of rules corresponding to Protocol Iterated Construction described in Section 4.2.2. Implement and properly test your implementation. Exercise 4.6.8 Prove that Protocol Iterated Construction converges to the correct routing tables and will do so after at most n − 1 iterations. Hint: Use induction to prove that Vxi [z] is the cost of the shortest path from x to z using at most i hops. Exercise 4.6.9 We have assumed that the cost of a link is the same in both directions, that is, θ (x, y) = θ (y, x). However, there are cases when θ(x, y) can be different from θ (y, x). What modiﬁcations have to be made so that protocol Iterated Construction works correctly also in those cases? Exercise 4.6.10 In protocol PT Construction, no action is provided for an idle entity receiving an Expand message. Prove that such a message will never be received in such a state. Exercise 4.6.11 In procedure Compute Local Minimum of protocol PT Construction, an entity might set path length to inﬁnity. Show that if this happens, this entity will set path length to inﬁnity in all subsequent iterations.

270

MESSAGE ROUTING AND SHORTEST PATHS

Exercise 4.6.12 In protocol PT Construction, each entity will eventually set path length to inﬁnity. Show that when this happens to a leaf of the constructed tree, that entity can be removed from further computations. Exercise 4.6.13 Modify protocol PT Construction so that it constructs the routing table RT(s) of the source s. Exercise 4.6.14 We have assumed that the cost of a link is the same in both directions, that is, θ(x, y) = θ (y, x). However, there are cases when θ (x, y) can be different from θ (y, x). What modiﬁcations have to be made so that protocol PT Construction works correctly also in those cases? Exercise 4.6.15 Prove that any G has a (log n, n) sparser. Exercise 4.6.16 Show how to construct a (log n, n) sparser with O(m + n log n) messages. Exercise 4.6.17 Show how to use a (log n, n) sparser to solve the all-pairs shortest paths problem in O(n2 log n) messages. Exercise 4.6.18 Assume that messages can contain O(n) items of information (instead of O(1)). Show how to construct all the shortest path trees with just O(n2 ) messages. Exercise 4.6.19 Prove that, after iteration i − 1 of protocol BF Construction, (a) all the nodes at distance up to i − 1 are part of the tree; (c) each node at distance i − 1 knows which of its neighbors are at distance i − 1. Exercise 4.6.20 Write the set of rules corresponding to protocol BF described in Section 4.2.2. Implement and properly test your implementation. Exercise 4.6.21 Write the set of rules corresponding to protocol BF Levels. Implement and properly test your implementation. Exercise 4.6.22 Let Explore(j, k) be the ﬁrst message x accepts in the expansion phase of protocol BF Levels. Prove that the number of times x will change its level in this phase is at most j − t + 1 < l. Exercise 4.6.23 Prove that in the expansion phase of an iteration of protocol BF Levels, all nodes in levels t + 1 to t + l are reached and attached to the existing fragment, where t is the level of the sources (i.e., the leaves in the current fragment). Exercise 4.6.24 Consider protocol BF Levels when l = d(G). Show how to obtain the same message and time complexity without any a priori knowledge of d(G).

EXERCISES, PROBLEMS, AND ANSWERS

271

Exercise 4.6.25 Prove that if we choose l = d(G) in protocol BF Levels, then in any synchronous execution the number of messages will be exactly 2m + n − 1. Exercise 4.6.26 () Show how to construct a breadth-ﬁrst spanning tree in time O(d(G)1+ ) using no more than O(m1+ ) messages, for any > 0. Exercise 4.6.27 Let c be a center of G and let SPT(c) be the shortest path tree of c. Prove that diam(G) ≤ 2 diam(SPT(c)). Exercise 4.6.28 Let T be a spanning tree of G. Prove that |T [y − x]|w(x, y) = u,v∈T dT (u,v).

(x,y)∈T

|T [x − y]|

Exercise 4.6.29 (median-based routing) Let z be a median of G (i.e., a node for which the sum of distances to all other nodes is minimized) and let PT(z) be the shortest path tree of z. Prove that Trafﬁc(PT(z)) ≤ 2 Trafﬁc(T ), where T is the spanning tree of G for which Trafﬁc is minimized. Exercise 4.6.30 Consider a ring network Rn with weighted edges. Prove or disprove that PT(c) = MSP(Rn ), where c is a center of Rn and MSP(Rn ) is the minimum-cost spanning tree of Rn . Exercise 4.6.31 Consider a ring network Rn with weighted edges. Let c and z be a center and a median of Rn , respectively. 1. For each of the following spanning trees of Rn , compare the stretch factor and the edge-stretch factor: PT(c), PT(z), and the minimum-cost spanning tree MSP(Rn ). 2. Determine bounds on the average edge-stretch factor of PT(c), PT(z), and MSP(Rn ). Exercise 4.6.32 () Consider a a × a square mesh Ma,a where all costs are the same. 1. Is it possible to construct two spanning trees T and T such that σ (T ) < σ (T ) but (T ) > (T ) ? Explain. 2. Is it possible to construct two spanning trees T and T such that σ (T ) < σ (T ) but (T ) > (T ) ? Explain. Exercise 4.6.33 Consider a square mesh Ma,a where all costs are the same. Construct two spanning trees T and T such that σ (T ) < σ (T ) but (T ) > (T ). Exercise 4.6.34 () Show that there are graphs G with unweighted edges where G (T ) = ⍀(log n) for every spanning tree T of G. Exercise 4.6.35 () Design an efﬁcient protocol for computing a spanning tree with low average edge-stretch of a network G with unweighted edges.

272

MESSAGE ROUTING AND SHORTEST PATHS

Exercise 4.6.36 () Design an efﬁcient protocol for computing a spanning tree with low average edge-stretch of a network G with weighted edges. Exercise 4.6.37 () Design a protocol for computing the secondary paths of a node x. You may assume that the shortest-path tree PT(x) has already been constructed and that each node knows its and its neighbors’ distance from x. Your protocol should use no more messages than that required to construct PT(x). Exercise 4.6.38 (split horizon) () Consider the following technique, called split horizon, for solving the count-to-inﬁnity problem discussed in Section 4.3.1: During an iteration, a node a does not send its cost for destination c to its neighbor b if b is the next node in the “best” path (so far) from a to c. In the example of Figure 4.13, in the ﬁrst iteration y does not send its cost for w to z, and thus z will correctly set its cost for w to K. In the next two iterations y and x will correctly set their cost for w to K + 1 and K + 2, respectively. Prove or disprove that split horizon solves the count-to-inﬁnity problem. Exercise 4.6.39 (split horizon with poison reverse) () Consider the following technique, called split horizon with poison reverse, for solving the count-to-inﬁnity problem discussed in Section 4.3.1: During an iteration, a node a sends its cost for destination c set to ∞ to its neighbor b if b is on the “best” path (so far) from a to c. Prove or disprove that split horizon with poison reverse solves the count-to-inﬁnity problem. Exercise 4.6.40 () Design an efﬁcient protocol that, given a shortest-path spanning tree PT(s), determines an optimal swap for every edge in PT(s): At the end of the execution, every node x knows the optimal swap edge for its incident link es [x]. Your protocol should use no more than O(nh(s)) messages, where h(s) is the height of PT(x). Exercise 4.6.41 () Show how to answer Exercise 4.6.40 using no more than O(n (s)) messages, where n (s) is the number of edges in the transitive closure of PT(x). Exercise 4.6.42 Let e = (u,v) be the optimal swap edge that x has computed for es [x]. Prove that, if es [x] fails, to achieve point-of-failure shortest path rerouting, x must send the message for s to the incident link (pv (x), x). Exercise 4.6.43 Show how to represent the intervals of a ring with just log n bits per interval. Exercise 4.6.44 Show how that the intervals of a ring can be represented with just log n bits per interval, even if the costs of the links are not all the same.

EXERCISES, PROBLEMS, AND ANSWERS

273

Exercise 4.6.45 Let G be a network and assume that we can assign names to the nodes so that in each routing table, the destinations for each link form an interval. Determine what conditions the intervals must satisfy so that they can be represented with just log n bits each. Exercise 4.6.46 Redeﬁne properties interval and disjointness in case the n integers used as names are not consecutive, that is, they are chosen from a larger set Zw , w > n. Exercise 4.6.47 Show an assignment of names in a tree that does not have the interval property. Does there exists an assignment of distinct names in a tree that has the interval property but not the disjointness one? Explain your answer. Exercise 4.6.48 Prove that in a tree, the assignment of names by Post-Order traversal has both interval and disjointness properties. Exercise 4.6.49 Prove that in a tree, also the assignment of names by Pre-Order traversal has both interval and disjointness properties. Exercise 4.6.50 Determine whether interval routing is possible in the regular graph shown in Figure 4.21. If so, show the routing table; otherwise explain why. Exercise 4.6.51 Design an optimal interval routing scheme for p × q mesh and torus. How many bits of storage will it require? Exercise 4.6.52 Design an optimal interval routing scheme for d-dimensional (a) hypercube, (b) butterﬂy, and (c) cube-connected cycles. How many bits of total storage will each require?

FIGURE 4.21: The regular graph used in Exercise 4.6.55.

274

MESSAGE ROUTING AND SHORTEST PATHS

Exercise 4.6.53 () Show how to assign names to the nodes of an outerplanar graph so that interval routing is possible. Exercise 4.6.54 () If for every x all the intervals in its routing table are strictly increasing (i.e., there is no “wraparound” node ”0), the interval routing is called linear. Prove that there are networks for which there exists interval routing but linear interval routing is impossible. Exercise 4.6.55 Prove that in the globe graph of Figure 4.20, interval routing is not possible. Exercise 4.6.56 () Consider the approach of k-interval routing. Prove that there are graphs that require k = O(n) intervals. Exercise 4.6.57 () Consider allowing each route to be within a factor α from optimal. Prove that if we want α = 2, there are graphs that require O(n2 ) bits of storage at each node. Exercise 4.6.58 () Consider allowing the longest route to be within a factor β from the diameter diam(G) of the network, using at most k labels per edge. Prove that if we want β < 23 , then there are graphs that require O(log n) bits of storage at each node. 4.6.2 Problems Problem 4.6.1 Linear Interval Routing. () If for every x all the intervals in its routing table are strictly increasing (i.e., there is no “wraparound” node 0), the interval routing is called linear. Characterize the class of graphs for which there exists a linear interval routing. 4.6.3 Answers to Exercises Partial Answer to Exercise 4.6.26. √ Choose the size of the strip to be k = d(G). A strip cover is a collection of trees that span all the source nodes of a strip. In iteration i, ﬁrst of all construct a “good” cover of strip i. Answer to Exercise 4.6.29. Observe that for any spanning tree T of G, Trafﬁc(T ) = u,v∈V dT (u,v) (Exercise 4.6.28). Let SumDist(x) = u∈V dG (u, x); clearly Trafﬁc(T ) ≥ x∈V SumDist(x). Let z be a median of G (i.e., a node for which SumDist T raff ic(T ). Thus we have that is minimized); then SumDist(z) ≤ n1 d (u, v ) ≤ ≤ Trafﬁc(PT(z)) = PT(z) u,v ∈V u,v ∈V (dPT(z) (u, z) + dPT(z) (z, v)) (n − 1) u∈V (dPT(z) (u, z) + (n − 1) v∈V (dPT(z) (v, z) = 2(n − 1)SumDist(z) ≤ 2Trafﬁc(T ).

BIBLIOGRAPHY

275

FIGURE 4.22: Graph with interval routing but where no linear interval routing exists.

Answer to Exercise 4.6.43. In the table of node x, the interval associated to right always starts with x + 1 while the one associated to left always ends with x − 1. Hence, for each interval, it is sufﬁcient to store only the other end value. Partial Answer to Exercise 4.6.54. Consider the graph shown in Figure 4.22. BIBLIOGRAPHY [1] Y. Afek and M. Ricklin. Sparser: a paradigm for running distributed algorithms. Journal of Algorithms, 14(2):316–28, March 1993. [2] N. Alon, R.M. Karp, D. Peleg, and D. West. A graph-theoretic game and its application to the k-server problem. SIAM Journal of Computing, 24:78–100, 1995. [3] B. Awerbuch. Reducing complexities of the distributed max-ﬂow and breadth-ﬁrst-search algorithms by means of network synchronization. Networks, 15:425–437, 1985. [4] B. Awerbuch. Distributed shortest path algorithms. In Proc. 21st Ann. ACM Symp. on Theory of Computing, pages 490–500, 1989. [5] B. Awerbuch and R.G. Gallager. A new distributed algorithm to ﬁnd breadth ﬁrst search trees. IEEE Transactions on Information Theory, 33:315–322, 1987. [6] B. Awerbuch. Complexity of network synchronization. Journal of the ACM, 32(4): 804–823, October 1985. [7] E.M. Bakker, Jan van Leeuwen, and Richard Tan. Linear interval routing. Algorithms Review, 2(2):45–61, 1991. [8] T.-Y. Cheung. Graph traversal techniques and the maximum ﬂow problem in distributed computation. IEEE Transactions on Software Engineering, 9:504–512, 1983. [9] T. Eilam, S. Moran, and S. Zaks. The complexity of the characterization of networks supporting shortest-path interval routing. In 4th International Colloquium on Structural Information and Communication Complexity, pages 99–11, Ascona, 1997. [10] M. Flammini, G. Gambosi, U. Nanni, and R.B. Tan. Characterization results of all shortest paths interval routing schemes. Networks, 37(4):225–232, 2001. [11] M. Flammini, G. Gambosi, and S. Salomone. Boolean routing. In 7th International Workshop on Distributed Algorithms, pages 219–233, Lausanne, 1993. [12] P. Flocchini, L. Pagli, T. Mesa, G. Prencipe, and N. Santoro. Point-of-failures shortest path rerouting: computing the optimal swaps distributively. IEICE Transactions, 2006. [13] L. R. Ford and D. R. Fulkerson. Flows in Networks. Princeton University Press, 1962.

276

MESSAGE ROUTING AND SHORTEST PATHS

[14] P. Fraigniaud and C. Gavoille. Interval routing schemes. Algorithmica, 21(2):155–182, 1998. [15] G.N. Frederickson. A distributed shortest path algorithm for a planar network. Information and Computation, 86(2):140–159, June 1990. [16] G.N. Frederickson and R. Janardan. Designing networks with compact routing tables. Algorithmica, 3:171–190, June 1988. [17] R.G. Gallager. Distributed minimum hop algorithms. Technical Report LIDS-P-1175, Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, 1982. [18] C. Gavoille and E. Guevremont. Worst case bounds for shortest path intervalrouting. Journal of Algorithms, 27:1–25, 1998. [19] C. Gavoille and D. Peleg. The compactness of interval routing. SIAM Journal on Discrete Mathematics, 12(4):459–473, 1999. [20] S. Haldar. An ‘all pairs shortest paths’ distributed algorithm using 2n2 messages. In Proceedings of the 19th International Workshop on Graph-Theoretic Concepts in Computer Science (WG’93), Utrecht, Netherlands, June 1993. [21] H. Ito, K. Iwama, Y. Okabe, and T. Yoshihiro. Single backup table schemes for shortestpath routing. Theoretical Computer Science, 333:347–353, 2004. [22] J. Misra K.M. Chandi. Distributed computations on graphs: shortest path algorithms. Communications of ACM, 25(11):833–837, November 1982. [23] Evangelos Kranakis and Danny Krizanc. Lower bounds for compact routing. In 13th Symposium on Theoretical Aspects of Computer Science, pages 529–540, Grenoble, feb 1996. [24] J. van Leeuwen and R.B. Tan. Interval routing. The Computer Journal, 30:298–307, 1987. [25] P.M. Merlin and A. Segall. A failsafe distributed routing protocol. IEEE Transactions on Communications, 27(9):1280–1287, sept 1979. [26] L. Narayanan and S. Shende. Characterization of networks supporting shortest-path interval labelling schemes. In 3rd International Colloquium on Structural Information and Communication Complexity, pages 73–87, 1996. [27] E. Nardelli, G. Proietti, and P. Widmayer. Swapping a failing edge of a single source shortest paths tree is good and fast. Algoritmica, 35:56–74, 2003. [28] K.V.S. Ramarao and S. Venkatesan. On ﬁnding and updating shortest paths distributively. Journal of Algorithms, 13(2):235–257, 1992. [29] E.C. Rosen. The updating protocol of Arpanet’s new routing algorithm. Computer Networks, 4:11–19, 1980. [30] N. Santoro and R. Khatib. Labeling and implicit routing in networks. The Computer Journal, 28:5–8, 1985. [31] W.D. Tajibnapis. A correctness proof of a topology information maintenance protocol for a distributed computer network. Communications of the ACM, 20(7):477–485, 1977. [32] S. Toueg. An all-pairs shortest-path distributed algorithm, 1980. [33] S.S.H. Tse and F.C.M. Lau. On the space requirement of interval routing. IEEE Transactions On Computers, 48(7):752–757, July 1999. [34] D.W. Wall and S. Owicki. Construction of centered shortest-path trees in networks. Networks, 13(2):207–332, 1983. [35] Y. Zhu and T.-Y. Cheung. A new distributed breadth-ﬁrst-search algorithm. Information Processing Letters, 25:329–333, 1987.

CHAPTER 5

Distributed Set Operations

5.1 INTRODUCTION In a distributed computing environment, each entity has its own data stored in its local memory. Some data items held by one entity are sometimes related to items held by other entities, and we focus and operate on them. An example is the set of the ids of the entities. What we did in the past was to operate on this set, for example, by ﬁnding the smallest id or the largest one. Another example is the set of the single values held by each entity, and the operation was to ﬁnd the overall rank of each of those values. In all these examples, the relevant data held by an entity consist of just a single data item. In general, an entity x has a set of relevant data Dx . The union of all these local sets forms a distributed set of data D=

Dx

(5.1)

x

and the tuple Dx1 , Dx2 , . . . , Dxn describes the distribution of D among the entities x1 , x2 . . . , xn . Clearly there are many different distributions of the same distributed set. There are two main types of operations that can be performed on a distributed set: 1. queries and 2. updates. A query is a request for some information about the global data set D, as well as about the individual sets Dx forming D. A query can originate at any entity. If the entity where the query originates has locally the desired information, the query can be answered immediately; otherwise, the entity will have to communicate with other entities to obtain the desired information. As usual, we are concerned with the communication costs, rather than the local processing costs, when dealing with answering a query. Design and Analysis of Distributed Algorithms, by Nicola Santoro Copyright © 2007 John Wiley & Sons, Inc.

277

278

DISTRIBUTED SET OPERATIONS

An update is a request to change the composition of the distributed set. There are two basic updates: the request to add a new element to the set, an operation called insertion; and the request to remove an element from the set, an operation called deletion. The third basic update is the request to change the value of an existing item of the set, an operation called change. Note that a change can be seen as a deletion of the item with the old value followed by an insertion of an item with the new value. There are many distributions of the same set. In a distribution, the local sets are not necessarily distinct or disjoint. Two extreme cases serve to illustrate the spectrum of distributions and the impact that the structure of the distribution has when handling queries and performing updates. One extreme distribution is the partition where the local sets have no elements in common: Di ∪ Dj = ∅,

i = j.

At the other end of the spectrum is the multiple-copy distribution where every entity has a copy of the entire data set. ∀i

Di = D.

A multiple-copy distribution is excellent for queries but poor for updates. Queries are easy because all entities possess all the data; hence every answer can be derived locally, without any communication. However, an update will require modiﬁcation of the data held at each and every entity; in the presence of concurrent updates, this process becomes exceedingly difﬁcult. The situation is reversed in the partition. As each data item is located in only one site, answering a query requires searching through all potential entities to ﬁnd the one that has locally stored the required data. By contrast, to perform an update is easy because the change is performed in only the entity having the item, and there is no danger of concurrent updates on the same item. In most cases, the data are partially replicated; that is, some data items are stored at more than one entities while others are to be found at only one entity. This means that, in general, we have to face and deal with the problems of both extremes, partition and multiple-copy distributions, without the advantages of either one. In the following we will ﬁrst focus on an important class of queries, called order statistics; the problem of answering such queries is traditionally called selection. As selection as well as most queries is more easily and efﬁciently solved if the distribution is sorted, we will also investigate the problem of sorting the distributed data. We will then concentrate on distributed set operations; that is, computing union, intersection, and differences of the local sets. The ability to perform such operations has a direct impact on the processing of complex queries usually performed in databases. To focus on the problems, we will assume the standard set of restrictions IR (Connectivity, Total Reliability, Bidirectional Links, Distinct Identiﬁers). For simplicity, as local processing time does not interest us when we consider the cost of our protocols, we will assume that all of the data stored at an entity are sorted.

DISTRIBUTED SELECTION

279

IMPORTANT. As we consider arbitrary distributions of the data set, it is possible that a data item a is in more than one local set. As we assume ID, we can use the ids of the entities to break ties and create a total order even among copies of the same value; so, for example, if a is in both Dx and Dy where id(x) > id(y), then we can say that the copy of a in Dx is “greater” than the one in Dy . In this way, if so desired, the copies can also be considered distinct and included in the global data set D by the union operation (5.1). 5.2 DISTRIBUTED SELECTION 5.2.1 Order Statistics Given a totally ordered data set D of size N distributed among the entities, the distributed selection problem is the general problem of locating D[K], the Kth smallest element of D. Problems of this type are called order statistics, to distinguish them from the more conventional cardinal statistics (e.g., average, standard deviation, etc.). Unlike cardinal statistics, ordinal ones are more difﬁcult to compute in a distributed environment. We have already seen and examined the problem of computing D[1] (i.e., the minimum value), and D[N ] (i.e., the maximum value). Other elements whose ranks are of particular importance are the medians of the data set. If N is odd, there is only one median, D[ N/2 ]. If N is even, there are two medians: the lower median D[N/2] and the upper median D[N/2 + 1]. Unlike the case of D[1] and D[N ], the problem of ﬁnding the median(s) and of K selection for an arbitrary value of K is not simple, and considerably more expensive to resolve. The complexity of the problem depends on many parameters including the number n of entities, the size N = |D| of the set, the number nx = |Dx | of elements stored at an entity x, the rank K of the element being sought, and the topology of the network. Before proceeding to examine strategies for its solution, let us introduce a fundamental property and a basic observation that will be helpful in our designs. Let D[K] denote the Kth largest element of the data set. Then − K + 1] Property 5.2.1 D[K] = D[N Thus looking for the Kth smallest is the same as looking for the (N − K + 1)th largest. Consider, for example, a set of 10 distinct elements; the 4th smallest is clearly the 7th largest; see Figure 5.1 where the elements d1 , . . . , d10 of the set are represented and sorted in an increasing order. This fact has many important consequences, as we will see later. The other useful tool is based on the trivial observation. Property 5.2.2 Dx [K + 1] > D[K] > D x [N − K + 2]. This means that, if an entity x has more than K items, it needs only to consider the smallest K items. Similarly, if x has more than (N − K + 1) items, it needs only to consider the largest (N − K + 1) items.

280

DISTRIBUTED SET OPERATIONS

K

D d1

d2

d3

d4

d5

d6

d7

d8

d9

d 10

N−K+1

FIGURE 5.1: The Kth smallest is precisely the (N − K + 1)th largest.

Finally, we will assume that the selection process will be coordinated by a single entity and that all communication will take place on a spanning tree of the network. Although it does not matter for the correctness of our protocols which entity is selected as coordinator and which spanning tree is chosen for communication, for efﬁciency reasons it is convenient to choose as coordinator a communication center s of the network and to choose as a spanning-tree SP(s) the shortest path spanning tree for s. Recall (Section 2.6.6) that a communicationcenter e is a node that minimizes the sum of the distances to all other nodes (i.e., v dG (v, s) is minimum). Also recall (Section 4.2.3) that, by deﬁnition of the shortest path spanning tree, PT(s) is such that dG (v, s) = dPT(s) (v, s) for all entities v. In the following we will assume that s is used as coordinator, and for simplicity we will denote PT(s) simply as T . 5.2.2 Selection in a Small Data Set We will ﬁrst consider the selection problem when the data set is rather small; more precisely, we consider data sets where N = O(n). A special instance of a small distributed set is when every Dx is a singleton: it contains just a single element dx ; this is, for example, the case when the only data available at a node is its id. Input Collection As the data set is small, the simple solution of collecting all the data at the coordinator and letting s solve locally the problem is actually not unfeasible from a complexity point of view. The cost of collecting all the data items at s is clearly v dG (v, s). To this, we must add an initial broadcast to notify the entities to send their data to the coordinator, and (if needed) a ﬁnal broadcast to notify them of the ﬁnal result; as these are done on a tree, their cost will be 2(n − 1) messages. Hence the total cost of this protocol that we can call Collect is M[Collect] =

v

communication.

dG (v, s) + 2(n − 1)

(5.2)

DISTRIBUTED SELECTION

281

Notice that, depending on the network, n−1≤

v

dG (v, s) ≤

n 2

n

−1

2

where the lower bound is achieved, for example, when G is a complete graph, and the upper is achieved, for example, when G is a ring. So M[Collect] = O(n2 ) in the worst case. This approach is somehow an overkill as the entire set is collected at s. Truncated Ranking It might be possible to reduce the amount of messages by making it dependent on the value of K. In fact we can use the existing ranking protocol for trees (Exercise 2.9.4) and execute it on T until the Kth smallest item is found. The use of the ranking algorithm will then cost no more than

2dG (v, s).

Rank(v)≤K

Note that, if K > N − K + 1 we can exploit Property 5.2.1 and use the ranking algorithm to assign ranks in decreasing order until the (N − K + 1)th largest element is ranked. In this case, the cost will then be no more than

dG (v, e).

Rank(v)≥K

To this we must add the initial broadcast to set up the ranking and a ﬁnal broadcast to notify the entities of the ﬁnal result; as these are done on a tree, their cost will be 2(n − 1) messages. Hence, assuming K ≤ N − K + 1, the total cost of this protocol that we can call Rank is M[Rank] ≤

2dG (v, s) + 2(n − 1).

(5.3)

Rank(v)≤K

Notice that, depending on the network, 2(K − 1) ≤

2dG (v, e) ≤

Rank(v)≤k

K 2

n−

K 2

+1

where the lower bound is achieved, for example, when G is a complete graph, and the upperbound could be achieved, for example, when G is a ring. This means that, in any case, M[Rank] ≤ n⌬ where ⌬ = Min{K, N − K + 1}. In other words, if K (or N − K + 1) is small, Rank will be much more efﬁcient than Collect. As K becomes larger, the cost increases until, when K = N/2, the two protocols have the same cost.

282

DISTRIBUTED SET OPERATIONS

IMPORTANT. The protocols we have seen are generic, in that they apply to any topology. For particular networks, it is possible to take advantage of the properties of the topology so to obtain a more efﬁcent selection protocol. This is the case of the ring (Exercise 5.6.1), the mesh (Exercise 5.6.2), and the complete binary tree (Exercise 5.6.3). The problem of designing a selection protocol that uses o(n2 ) messages in the worst case is still unsolved (Problem 5.6.1). 5.2.3 Simple Case: Selection Among Two Sites In the previous section we have seen how to perform selection when the number of data items is small: N = O(n). In general, this is not the case; in fact, not only N is much larger than n but it is order of magnitude so. So, in general, the techniques that we have seen so far are clearly not efﬁcient. What we need is a different strategy to deal with the general case, in particular when N >> n. In this section we will examine this problem in a simple setting when n = 2; that is, there are only two entities in the system, x and y. We will develop efﬁcient solution strategies; some of the insights will be useful when faced with a more general case in later sections. Median Let us consider ﬁrst the problem of determining the lower median, that is, D[N/2]. Recall that this is the unique element that has exactly N/2 − 1 elements smaller than itself and exactly N/2 elements larger than itself. A simple solution is the following. First of all, one of the entities (e.g., the one where the selection query originates, or the one with the smallest id) is elected, which will receive the entire set of the other entity. The elected entity, say x, will then locally determine the median of the set Dx ∪ Dy and communicate it, if necessary, to the other entity. Notice that as x has now locally available the entire data set, it can answer any selection query, not just for the lower median. The drawback of this solution is that the amount of communication is signiﬁcant as an entire local set is transferred. We can obviously elect the entity with the larger set to minimize the amount of messages; still, O(N ) messages must be transferred in the worst case. A more efﬁcient technique is clearly needed. We can design such a technique on the basis of a simple observation: if we compare the medians of the two local sets, then we can immediately eliminate almost half of the elements from consideration. Let us see why and how. Assume for simplicity that each local set contains N/2 = 2p−1 elements; this means that both Dx and Dy have a lower median, mx = Dx [2p−2 ] and my = Dy [2p−2 ] respectively. The lower median will have exactly N/2 − 1 = 2p − 1 elements smaller than itself and exactly N/2 = 2p elements larger than itself. For example, consider the two sets of size N/2 = 16 shown in Figure 5.2(a) where each black circle indicates a data element, and in each set the elements are shown locally sorted in a left-to-right increasing order; then mx = Dx [8] and my = Dy [8]. Assume that mx > my ; then each element in Dx larger than mx must also be larger than my . This means that each of them is larger than at least 2p−2 elements in Dx and that of at least 2p−2 elements in Dy ; that is, it has at least 2p−2 + 2p−2 = 2p−1 = N/2 elements smaller than itself, and therefore it can not be the lower median. In other

DISTRIBUTED SELECTION

283

mx Dx mx > m y Dy my (a)

too large Dx >m x

< my Dy too small (b)

Dx

Dy

(c)

FIGURE 5.2: Half of the elements can be discarded after a single comparison of the two local medians.

words, any element larger than the largest of the median of the two sets can be discounted from consideration as it is larger than the overall median. See Figure 5.2(b). Similarly, all the elements in Dy smaller than mx can be discounted as well. In fact, each such element would be smaller that at least 2p−2 elements in its own set and at least 2p−2 + 1 elements in the other set; that is, it has at least 22p−2 + 1 = 2p−1 + 1 = N/2 + 1 elements larger than itself, and therefore it can not be the lower median. See Figure 5.2(b). Thus, by locally calculating and then exchanging the median of each set, at least half of the elements of each set, and therefore half of the total number of elements, can be discounted; shown as white circle in Figure 5.2(c).

284

DISTRIBUTED SET OPERATIONS

There is a very interesting and important property (Exercise 5.6.4): the overall lower median is the lower median of the elements still under consideration. This means that we can reapply the same process to the elements still under consideration: the entities communicate to each other the lower median of the local elements under consideration, these are compared, and half of all this data are removed from consideration. In other words, we have just designed a protocol, that we shall call Halving, that is composed of a sequence of iterations; in each, half of the elements still under consideration are discarded and the sought global median is still the median of the considered data; this process is repeated until only a single element is left at each site and the median can be unambiguously determined. As we halve the problem size at every iteration, the total number of iterations is log N . Each iteration requires the communication of the local lower medians (of the elements still under consideration), a task that can be accomplished using just one message per iteration. The working of the protocol has been described assuming that N is a power of two and that both sets have the same number N/2 of elements. Fortunately, these two assumptions are not essential. In fact the protocol Halving can be adjusted to two arbitrarily sized sets without changing its complexity: Exercise 5.6.5. Arbitrary K We have just seen a simple and efﬁcient protocol for ﬁnding the overall (lower) median D[ N/2 ] of a set D distributed over two sites. Let us consider the general problem of selecting D[K], the Kth smallest element of D when K is arbitrary, 1 ≤ k ≤ N. Assume again, for simplicity, that the two sets have the same size N/2. We know already how to deal with the case of K = N/2. Case K < N/2 Consider ﬁrst the case when K < N/2. This means that each of the two sites has locally more than K elements. An example with N/2 = 12 and K = 4 is shown in Figure 5.3. Consider the set Dx . As we are looking for the Kth smallest data item overall, any data item greater than Dx [K] cannot be D[K] (as it will be larger than at least K data items). This means that we can immediately discount all these items, keeping only K items still under consideration. For example, in Figure 5.3(a) we have N/2 = 12 items shown in a left-to-right increasing order; if K = 4, then all the items greater than Dx [4] are too large to be D[4]: Figure 5.3(b). Similarly, we can keep under consideration in Dy just Dy [K] and the items that are smaller. IMPORTANT. Notice that D[K] is also the Kth smallest item among those kept in consideration; this is because we have discounted only the elements larger than D[K]. What is the net result of this ? We are now left with two sets of items, each of size K; see Figure 5.3(c). Among those items, we are looking for the Kth smallest

DISTRIBUTED SELECTION

285

Dx

Dy

(a)

>Dx [k]

Dx [k] Dx

too large Dy >Dy [k]

Dy [k] (b)

Dx

Dy

(c)

FIGURE 5.3: All the elements greater than the local Kth smallest element can be discarded.

element. In other words, once this operation has been performed, the problem we need to solve is to determine the lower median of the elements under consideration. We already know how to solve this problem efﬁciently. In other words, if K < N/2 we can reduce the problem to that of ﬁnding the lower median. Notice that this is accomplished without any communication, once it is known that we are looking for D[K]. Case K > N/2 Consider next the case when K > N/2. This means that each of the two sites has locally less than K elements, thus we cannot use the approach we did for K < N/2. Still, we can make a similar reduction also in this case. To see how and why, consider the following obvious but important property of any totally ordered set.

286

DISTRIBUTED SET OPERATIONS

Looking for the Kth smallest is the same as looking for the (N − K + 1)th largest. This fact is an important practical consequence. First of all observe that if K > N/2 then N − K + 1 < N/2. Further observe that the (N − K + 1)th largest item is the only one that has exactly N − k larger than itself and exactly K − 1 smaller than itself. Consider Dx . As we are looking for the (N − K + 1)th largest data item overall, (as there are at least any data item smaller than D x [N − K + 1] cannot be D[K] N − K + 1 larger data items). This means that we can immediately discount all these items, keeping only N − K + 1 items still under consideration. For example, in Figure 5.4(a) we have N = 24 items equidistributed between the two sites, whose items are shown in a left-to-right increasing order. If K = 21, then N − K + 1 = 4; that is, we are looking for the 4th largest item overall; then all the items smaller than the 4th largest in Dx , that is, smaller than Dx [4], are too small to be D[21] = D[4], see Figure 5.3(b). Similarly, we can keep under consideration in Dy just D y [N − K + 1] and the items that are larger.

Dx

Dy

(a)

> Dx [4]

Dx [4]

Dx too small (b)

Dx

Dy

(c)

FIGURE 5.4: All the data item smaller than the local (N−K+1)th largest element can be discarded.

DISTRIBUTED SELECTION

287

IMPORTANT. Notice that D[K] is the (N − K + 1)th largest item among those kept in consideration; this is because we have discounted only elements smaller than D[K]. What is the net result of this ? We are now left with two sets of items, each of size N − K + 1; see Figure 5.4(c). Among those items, we are looking for the (N − K + 1)th largest element. In other words, once this operation has been performed, the problem we need to solve is to determine the upper median of the elements under consideration. We already know how to solve this problem efﬁciently. Summary Regardless of the value of K we can always transform the K-selection problem into a median-ﬁnding problem. Notice that this is accomplished without any additional communication, once it is known that we are looking for D[K]. In the description we have assumed that both sites have the same number of element, N/2. If this is not the case, it is easy to verify (Exercise 5.6.6) that the same type of reduction can still take place. Hacking As we have seen, median ﬁnding is “the” core problem to solve. Our solution, Halving, is efﬁcient. This protocol can be made more efﬁcient by observing that we can discard (because it is too large to be the median) any element greater than mx not only in Dx but also in Dy (if there is any); similarly, we can discard the elements smaller than my (because it is too small to be the median) not only from Dy but also from Dx (if there is any). In this way we can reduce the amount of elements still under consideration by more than half, thus possibly reducing the number of iterations. CAUTION: The number of discarded items that are greater than the median might be larger than the number of discarded items that are smaller than the median (or vice versa). This means that the overall lower median we are looking for is no longer the median of the elements left under consideration. In other words, after removing items from consideration, we might be left with a general selection problem. By now, we know how to reduce a selection problem to the median-ﬁnding one. The resulting protocol, that we shall call GeneralHalving, will use a few more messages, in each iteration but might yield a larger reduction (Exercise 5.6.7). Generalization This technique can be generalized to three sites; however, we are no longer able to reduce the number of items still under consideration to at most half at each iteration (Exercise 5.6.9). For larger n > 3 the technique we have designed for two sites is unfortunately no longer efﬁciently scalable. Fortunately, some lessons we have learned when dealing with the two sites are immediately and usefully applicable to any n, as we will discuss in the next section. 5.2.4 General Selection Strategy: RankSelect In the previous section we have seen how to perform selection when the number of data items is small or there are only two sites. In general, this is not the case. For

288

DISTRIBUTED SET OPERATIONS

example, in most practical applications, the number of sites is 10–100, while the amount of data at each site is ≥ 106 . What we need is a different strategy to deal with the general case. Let us think of the set D containing the N elements as a search space in which we need to ﬁnd d ∗ = D[K], unknown to us, and the only thing we know about d ∗ is its rank Rank[d ∗ , D] = K. An effective way to handle the problem of discovering d ∗ is to reduce as much as possible the search space, eliminating from consideration as many items as possible, until we ﬁnd d ∗ or the search space is small enough (e.g., O(n)) for us to apply the techniques discussed in the previous section. Suppose that we (somehow) know the rank Rank[d, D] of a data item d in D. If Rank[d, D] = K then d is the element we were looking for. If Rank[d, D] < K then d is too small to be d ∗ , and so are all the items smaller than d. Similarly, if Rank[d, D] > K, then d is too large to be d ∗ , and so are all the items larger than d. This fact can be employed to design a simple and, as we will see, rather efﬁcient selection strategy: Strategy RankSelect: 1. Among the data items under consideration, (initially, they all are) choose one, say d. 2. Determine its overall rank k = Rank[d, D]. 3. If k = K then d = d ∗ and we are done. Else, if k < K, (respectively, k > K) remove from consideration d all the data items smaller (respectively, larger) than d and restart the process. Thus, according to this strategy, the selection process consists of a sequence of iterations, each reducing the search space, performed until d ∗ is found. Notice that we could stop the process as soon as just few data items (e.g., O(n)) are left for consideration, and then apply protocol Rank. Most of the operations performed by this strategy are rather simple to implement. We can assume that a spanning tree of the network is available and will be used for all communication, and an entity is elected to coordinate the overall execution (becoming the root of the tree for this protocol). Any entity can act as a coordinator and any spanning-tree T of the network will do. However, for efﬁciency reasons, it is better to choose as a coordinator the communication center s of the network, and choose as a tree T the shortest path spanning-tree PT(s) of s. Let d(i) be the item selected at the beginning of iteration i. Once d(i) is chosen, the determination of its rank is a trivial broadcast (to let every entity know d(i)) started by the root s and a convergecast (to collect the partial rank information) ending at the root s. Recall Exercise 2.9.43. Once d(i) has determined the rank of d(i), s will notify all other entities of the result: d(i) = d ∗ , d(i) < d ∗ , or d(i) > d ∗ ; each entity will then act accordingly (terminating or removing some elements from consideration).

DISTRIBUTED SELECTION

289

The only operation still to be discussed is how we choose d(i). The choice of d(i) is quite important because it affects the number of iterations and thus the overall complexity of the resulting protocol. Let us examine some of the possible choices and their impact. Random Choice We can choose d(i) uniformly at random; that is, in such a way that each item of the search space has the same probability of being chosen. How can s choose d(i) uniformly at random ? In Section 2.6.7 and Exercise 2.9.52 we have discussed how to select, in a tree, uniformly at random an item from the initial distributed set. Clearly that protocol can be used to choose d(i) in the ﬁrst iteration of our algorithm. However, we cannot immediately use it in the subsequent iterations. In fact, after an iteration, some items are removed from consideration; that is, the search space is reduced. This means that, for the next iteration, we must ensure we select an item that is still in new search space. Fortunately, this can be achieved with simple readjustments to the protocol of Exercise 2.9.52, achieving the same cost in each iteration (Exercise 5.6.10). That is, each iteration costs at most 2(n − 1) + dT (s, x) messages and 2r(s) + dT (s, x) ideal time units for the random selection plus an additional 2(n − 1) messages and 2r(s) time units to determine the rank of the selected element. Let us call the resulting protocol RandomSelect. To determine its global cost, we need to determine the number of iterations. In the worst case, in iteration i we remove from the search space only d(i); so the number of iterations can be as bad as N , for a worst case cost of M[RandomSelect] ≤ (4(n − 1) + r(s)) N,

(5.4)

T [RandomSelect] ≤ 5 r(s) N.

(5.5)

However, on the average, the power of making a random choice is evident; in fact (Exercise 5.6.11): Lemma 5.2.1 The expected number of iterations performed by Protocol RandomSelect until termination is at most 1.387 log N + O(1). This means that, on the average Maverage [RandomSelect] = O(n log N ),

(5.6)

Taverage [RandomSelect] = O(n log N ).

(5.7)

As mentioned earlier, we could stop the strategy RankSelect, and thus terminate protocol RandomSelect, as soon as O(n) data items are left for consideration, and then apply protocol Rank. See Exercise 5.6.12.

290

DISTRIBUTED SET OPERATIONS

Random Choice with Reduction We can improve the average message complexity by exploiting the properties discussed in Section 5.2.1. Let ⌬(i) = min{K(i), N (i) − K(i) + 1}. In fact, by Property 5.2.2, if at the beginning of iteration i, an entity has more than K(i) elements under consideration, it needs to consider only the K(i) smallest and immediately remove from consideration the others; similarly, if it has more than N (i) − K(i) + 1 items, it needs to consider only the N (i) − K(i) + 1 largest and immediately remove from consideration the others. If every entity does this, the search space can be further reduced even before the random selection process takes place. In fact, the net effect of the application of this technique is that each entity will have at most ⌬(i) = min{K(i), N (i) − K(i) + 1} items still under consideration during iteration i. The root s can then perform random selection in this reduced space of size n(i) ≤ N (i). Notice that d ∗ will have a new rank k(i) ≤ K(i) in the new search space. Speciﬁcally, our strategy will be to include, in the broadcast started by the root s at the beginning of iteration i, the values N (i) and K(i). Each entity, upon receiving this information, will locally perform the reduction (if any) of the local elements and then include in the convergecast the information about the size of the new search space. At the end of the convergecast, s knows both n(i) and k(i) as well as all the information necessary to perform the random selection in the reduced search space. In other words, the total number of messages per iteration will be exactly the same as that of Protocol RandomSelect. In the worst case this change does not make any difference. In fact, for the resulting protocol RandomFlipSelect, the number of iterations can still be as bad as N (Exercise 5.6.13), for a worst case cost of M[RandomFlipSelect] ≤ (2(n − 1) + r(s)) N,

(5.8)

T [RandomFlipSelect] ≤ 3 r(s) N.

(5.9)

The change does however make a difference on the average cost. In fact, (Exercise 5.6.14) Lemma 5.2.2 The expected number of iterations performed by Protocol RandomFlipSelect until termination is less than ln(⌬) + ln(n) + O(1) where ln() denotes the natural logarithm (recall that ln() = .693 log()). This means that, on the average Maverage [RandomFlipSelect] = O(n (ln(⌬) + ln(n)))

(5.10)

Taverage [RandomFlipSelect] = O(n (ln(⌬) + ln(n))).

(5.11)

DISTRIBUTED SELECTION

291

Also in this case, we could stop the strategy RankSelect, and thus terminate protocol RandomSelect, as soon as only O(n) data items are left for consideration, and then apply protocol Rank. See Exercise 5.6.15. Selection in a Random Distribution So far, we have not made any assumption on the distribution of the data items among the entities. If we know something about how the data are distributed, we can clearly exploit this knowledge to design a more efﬁcient protocol. In this section we consider a very simple and quite reasonable assumption about how the data are distributed. Consider the set D; it is distributed among the entities x1 , . . . , xn ; let n[xj ] = |Dxj | be the number of items stored at xj . The assumption we will make is that all the distributions of D that end up with n[xj ] items at xj , 1 ≤ j ≤ n, are equally likely. In this case we can reﬁne the selection of d(i). Let z(i) be the entity where the number of elements still under consideration in iteration i is the largest; that is, ∀x m(i) = |Dz(i) (i)| ≥ |Dx (i)|. (If there is more than one entity with the same number of items, choose an arbitrary one.) In our protocol, which we shall call RandomRandomSelect, we will choose d(i) to be the h(i)th smallest item in the set Dz(i) (i), where h(i) = K(i) m(i)+1 − 21 . N+1 We will use this choice until there are less than n items under consideration. At this point, in Protocol RandomRandomSelect we will use Protocol RandomFlipSelect to ﬁnish the job and determine d ∗ . Notice that also in this protocol, each iteration can easily be implemented (Exercise 5.6.16) with at most 4(n − 1) + r(s) messages and 5r(s) ideal time units. With the choice of d(i) we have made, the average number of iterations, until there are less than n items left under consideration, is indeed small. In fact (Exercise 5.6.17), Lemma 5.2.3 Let the randomness assumption hold. Then the expected number of iterations performed by Protocol RandomRandomSelect until there are less than n items under consideration is at most 4 3 log log ⌬ + 1 .

This means that, on the average Maverage [RandomRandomSelect] = O(n(log log ⌬ + log n)) and Taverage [RandomRandomSelect] = O(n(log log ⌬ + log n)).

(5.12) (5.13)

Filtering The drawback of all previous protocols rests on their worst case costs: O(nN) messages and O(r(s)N ) time; notice that this cost is more than that of input collection, that is, of mailing all the items to s. It can be shown that the probability of the occurrence of the worst case is so small that it can be neglected. However, there

292

DISTRIBUTED SET OPERATIONS

might be systems where such a cost is not affordable under any circumstances. For these systems, it is necessary to have a selection protocol that, even if less efﬁcient on the average, can guarantee a reasonable cost even in the worst case. The design of such a system is fortunately not so difﬁcult; in fact it can be achieved with the strategy RankSelect with the appropriate choice of d(i). As before, let Dxi denote the set of elements still under consideration at x in iteration i and nix = |Dxi | denote its size. Consider the (lower) median dxi = Dxi [ nix /2 ] of Dxi , and let M(i) = {dxi } be the set of these medians. With each element in M(i) associate a weight; the weight associated with dxi is just the size of the corresponding set nix . Filter: Choose d(i) to be the weighted (lower) median of M(i). With this choice, the number of iterations is rather small (Exercise 5.6.18): Lemma 5.2.4 The number of iterations performed by Protocol Filter until there are no more than n elements left under consideration is at most 2.41 log(N/n). Once there are at most n elements left after consideration, the problem can be solved using one of the known techniques, for example, Rank, for small sets. However, each iteration requires a complex operation; in fact we need to ﬁnd the median of the set M(i) in iteration i. As the set is small (it contains at most n elements), this can be done using, for example, Protocol Rank. In the worst case, it will require O(n2 ) messages in each iteration. This means that, in the worst case, N M[Filter] = O n2 log n N . T [Filter] = O n log n

(5.14) (5.15)

5.2.5 Reducing the Worst Case: ReduceSelect The worst case we have obtained by using the Filter choice in strategy RankSelect is reasonable but it can be reduced using a different strategy. This strategy, and the resulting protocol that we shall call ReduceSelect, is obtained mainly by combining and integrating all the techniques we have developed so far for reducing the search space with new, original ones. Reduction Tools so far.

Let us summarize ﬁrst of all the main basic tool we have used

Reduction Tool 1: Local Contraction If entity x has more than ⌬ items under consideration, it can immediately discard any item greater than the local Kth smallest element and any item smaller than the local (N − K + 1)th largest element.

DISTRIBUTED SELECTION

293

This tool is based on Property 5.2.2. The requirement for the application of this tool is that each site must know K and N . The net effect of the application of this tool is that, afterwards, each site has at most ⌬ items under considerations that are stored locally. Recall that we have used this reduction tool already when dealing with the two sites case, as well as in Protocol RandomFlipSelect. A different type of reduction is offered by the following tool. Reduction Tool 2: Sites Reduction If the number of entities n is greater than K (respectively, N − K + 1), then n − N entities (respectively n − N + K − 1) and all their data items can be removed from consideration. This can be achieved as follows. 1. Consider the set Dmin = {Dx [1]} (respectively Dmax = {Dx [|Dx |]}) of the smallest (respectively, the largest) item at each entity. 2. Find the Kth smallest (respectively, (N − K + 1)th largest) element, call it w, of this set. NOTE: This set has n elements; hence this operation can be performed using protocol Rank. 3. If Dx [1] > w (respectively Dx [|Dx |] < w) then the entire set Dx can be removed from consideration. This reduction technique immediately reduces the number of sets involved in the problem to at most ⌬. For example, consider the case of searching for the 7th largest item when the N data items of D are distributed among n = 10 entities. Consider now the largest element stored at each entity (they form a set of 10 elements), and ﬁnd the 7th largest of them. The 8th largest element of this set cannot possibly be the 7th largest item of the entire distributed set D; as it is the largest item stored at the entity from which it originated, none of the other items stored at that entity can be the 7th largest element either; so we can remove from consideration the entire set stored at that entity. Similarly we can remove also the sets where the 9th and the 10th largest came from. These two tools can obviously be used one after the other. The combined use of these two tools reduces the problem of selection in a search space of size N distributed among n sites to that of selection among Min {n, ⌬} sites, each with at most ⌬ elements. This means that, after the execution of these two tools, the new search space contains at most ⌬2 data items. Notice that once the tools have been applied, if the size of the search space and/or the rank of f ∗ in that space have changed, it is possible that the two tools can be successfully applied again. For example, consider the case depicted in Table 5.1, where N = 10, 032 is distributed among n = 5 entities, x1 , . . . x5 , and where we are looking for the Kth smallest element in this set, where K = 4096. First observe that, when we apply the two Reduction Tools, only the ﬁrst one (Contraction) will be successful. The effect will be to remove from consideration many elements from x1 , all larger than f ∗ . In other words, we have signiﬁcantly reduced the search space without changing the rank of f ∗ in the search space. If we apply again the two Reduction Tools to the new

294

DISTRIBUTED SET OPERATIONS

TABLE 5.1: Repeated use of the Reduction Tools N : size of search space

K : rank of f ∗ in search space

x1

x2

x3

x4

x5

10, 032 4, 126 65

4, 096 4, 096 33

10, 000 4, 096 33

20 20 20

5 5 5

5 5 5

2 2 2

conﬁguration, again only the ﬁrst one (Contraction) will be successful; however the second will further drastically reduce the size of the search space (the variable N ) from 4126 to 65 and the rank of f ∗ in the new search space (the variable K) from 4096 to 33. This fact means that we can iterate Local Contraction until there will no longer be any change in the search space and in the rank of f ∗ in the search space. This will occur when at each site xi the number of items still under consideration ni is not greater than ⌬ = min{K , N − K + 1}, where N is the size of the search space and K the rank of f ∗ in the search space. We will then use the Sites Reduction tool. The reduction protocol REDUCE based on this repeated use of the two Reduction Tools is shown in Figure 5.5. Lemma 5.2.5 After the execution of Protocol REDUCE, the number of items left under consideration is at most ⌬ min{n, ⌬}. The single execution of Sites Reduction requires selection in a small set discussed in Section 5.2.2. Each execution of Local Contraction required by Protocol REDUCE requires a broadcast and a convergecast, and costs 2(n − 1) messages and 2r(s) time. To determine the total cost we need to ﬁnd out the number of times Local Contraction is executed. Interestingly, this will occur a constant number of times, three times to be precise (Exercise 5.6.19). REDUCE begin N = N; K = K; ⌬ = ⌬; ni = ni , 1 ≤ i ≤ n; while ∃xi such that ni > ⌬ do perform Local Contraction; * update the values of N , K , ⌬ , ni (1 ≤ i ≤ n) endwhile if n > ⌬ then perform Sites Reduction; endif end FIGURE 5.5: Protocol REDUCE.

DISTRIBUTED SELECTION

295

Cutting Tools The new tool we are going to develop is to be used whenever the number n of sets is at most ⌬ and each entity has at most ⌬ items; this is, for example, the result of applying Tools 1 and 2 described before. Thus, the search space contains at most ⌬2 items. For simplicity, and without loss of generality, let K = ⌬ (the case N − K + 1 = ⌬ is analogous). To aid in the design, we can visualize the search space as an array D of size n × ⌬, where the rows correspond to the sets of items, each set sorted in an increasing order, and the columns specify the rank of that element in the set. So, for example, di,j is the jth smallest item in the set stored at entity xi . Notice that there is no relationship among the elements of the same column; in other words, D is a matrix with sorted rows but unsorted columns. Each column corresponds to a set of n elements distributed among the n entities. If an element is removed from consideration, it will be represented by +∞ in the corresponding entry in the array. Consider the set C(2), that is, all the second-smallest items in each site. Focus on the kth smallest element m(2) of this set, where k = K/2. By deﬁnition, m(2) has exactly k − 1 elements smaller than itself in C(2); each of them, as well as m(2), has another item smaller than itself in its own row (this is because they are second-smallest in their own set). This means that, as far as we know, m(2) has at least (k − 1) + k = 2k − 1 ≥ K − 1 items smaller than itself in the global set D; this implies that any item greater than m(2) cannot be the Kth smallest item we are looking for. In other words, if we ﬁnd m(2), then we can remove from consideration any item larger than m(2). Similarly, we can consider the set C(2i ), where 2i ≤ K, composed of the 2i th smallest items in each set. Focus again on the kth smallest element m(2i ) of C(2i ), where k = K/2i . By deﬁnition, m(2i ) has exactly k − 1 elements smaller than itself in C(2); each of them, as well as m(2i ), has another 2i − 1 items smaller than itself in its own row (this is because they are the 2i th smallest in their own set). This means that m(2i ) has at least (k − 1) + k (2i − 1) = k 2i − 1 ≥

K 2i

2i − 1 = K − 1

items smaller than itself in the global set D; this implies that any item greater than m(2i ) cannot be the Kth smallest item we are looking for. In other words, if we ﬁnd m(2i ), then we can remove from consideration any item larger than m(2i ). Thus, we have a generic Reduction Tool using columns whose index is a power of two.

296

DISTRIBUTED SET OPERATIONS

CUT begin k = K/2; l := 2; while k ≥ log K and search space is not small do if in C(2l ) there are ≥ k items still under consideration then * use the CuttingT ool : find the kth smallest element m(l) of C(l); remove from consideration all the elements greater than m(l). endif k := k/2; l := 2l; endwhile end FIGURE 5.6: Protocol CUT.

Cutting Tool Let l = 2i ≤ K and k = K/ l . Find the kth smallest element m(l) of C(l), and remove from consideration all the elements greater than m(l). The Cutting Tool can be implemented using any protocol for selection in small sets (recall that each C(l) has at most n elements), such as Rank; a single broadcast will notify all entities of the outcome and allow each to reduce its own set if needed. On the basis of this tool we can construct a reduction protocol that sequentially uses the Cutting Tool ﬁrst using C(2), then C(4), then C(8), and so on. Clearly, if at any time the search space becomes small (i.e., O(n)), we terminate. This reduction algorithm, that we will call CUT, is shown in Figure 5.6. Let us examine the reduction power of Procedure CUT. After executing the Cutting Tool on C(2), only one column, C(1), might remain unchanged; all others, including C(2), will have at least half of the entries +∞. In general, after the execution of Cutting Tool on C(l = 2i ), only the l − 1 columns C(1), C(2), . . . , C(l − 1) might remain unchanged; all others, including C(l) will have at least n − K/ l of the entries +∞ (Exercise 5.6.20). This can be used to show (Exercise 5.6.21) that Lemma 5.2.6 After the execution of Protocol CUT, the number of items left under consideration is at most min{n, ⌬} log ⌬. Each of the log ⌬ execution of the Cutting Tool performed by Protocol CUT requires a selection in a set of size at most min{n, ⌬}. This can be performed using any of the protocols for selection in a small set, for example, Protocol Rank. In the worst case, it will require O(n2 ) messages in each iteration. This means that, in the worst case, M[CU T ] = O(n2 log ⌬),

(5.16)

T [CU T ] = O(n log ⌬).

(5.17)

SORTING A DISTRIBUTED SET

297

ReduceSelect begin

REDUCE;

if search space greater than O(⌬ ) then CUT if search space greater than O(n) then Filter Rank; end FIGURE 5.7: Protocol ReduceSelect.

Putting It All Together We have examined a set of Reduction Tools. Summarizing, Protocol REDUCE, composed of the application of Reduction Tools 1 and 2, reduces the search space from N to at most ⌬2 . Protocol CUT, composed of a sequence of applications of the Cutting Tool, reduces the search space from ⌬2 to at most min{n, ⌬} log ⌬. Starting from these reductions, to form a full selection protocol, we will ﬁrst reduce the search space from min{n, ⌬} log ⌬ to O(n) (e.g. using Protocol Filter) and then use a protocol for small sets (e.g. Rank) to determine the sought item. In other words, resulting algorithm, Protocol ReduceSelect, will be as shown in Figure 5.7, where ⌬ is the new value of ⌬ after the execution of REDUCE. Let us examine the cost of Protocol ReduceSelect. Protocol REDUCE, as we have seen, requires at most 3 iterations of Local Contractions, each using 2(n − 1) messages and 2r(s) time, and one execution of Sites Reduction that consists in an execution of Rank. Protocol CUT is used with N ≤ min{n, ⌬}⌬ and, as we have seen, thus, requires at most log ⌬ iterations of the Cutting Tools, each consisting in an execution of Rank. Protocol Filter, as we have seen, is used with N ≤ min{n, ⌬} log ⌬ and, as we have seen, thus, requires at most log log ⌬ iterations, each costing 2(n − 1) messages and 2r(s) time plus an execution of Rank. Thus, in total, we have M[ReduceSelect] = (log ⌬ + 4.5 log log ⌬ + 2)M[Rank] + (6 + 4.5 log log ⌬)(n − 1),

(5.18)

T [ReduceSelect] = (log ⌬ + 4.5 log log ⌬ + 2)T [Rank] + (6 + 4.5 log log ⌬)2r(s).

(5.19)

5.3 SORTING A DISTRIBUTED SET 5.3.1 Distributed Sorting Sorting is perhaps the most well known and investigated algorithmic problem. In distributed computing systems, the setting where this problem takes place as well as its nature is very different from the serial as well as parallel ones. In particular, in our setting, sorting must take place in networks of computing entities where no central controller is present and no common clock is available. Not surprisingly, most

298

DISTRIBUTED SET OPERATIONS

{11, 22, 30, 34, 45}

{68, 69, 71, 75}

{68, 69, 71, 75, 82}

{11, 22, 30, 34}

3

2

3

2

1

4

1

4

{56, 57}

{82, 85, 87}

{85, 87}

(a)

{45, 56, 57}

(b)

FIGURE 5.8: Distribution sorted according to (a) π = 3124 and (b) π = 2431.

of the best serial and parallel sorting algorithms do very poorly when applied to a distributed environment. In this section we will examine the problem, its nature, and its solutions. Let us start with a clear speciﬁcation of the task and its requirements. As before in this chapter, we have a distribution Dx1 , . . . , Dxn of a set D among the entities x1 , . . . , xn of a system with communication topology G, where Dxi is the set of items stored at xi . Each entity xi , because of the Distinct Identiﬁers assumption ID, has a unique identity id(i), from a totally ordered sets. For simplicity, in the following we will assume that the ids are the numbers 1, 2, . . . , n and that id(i) = i, and we will denote Dxi simply by Di . Let us now focus on the deﬁnition of a sorted distribution. A distribution is (quite reasonably) considered sorted if, whenever i < j , all the data items stored at xi are smaller than the items stored at xj ; this condition is usually called increasing order. A distribution is also considered sorted if all the smallest items are in xn , the next ones in xn−1 , and so on, with the largest ones in x1 ; usually, we call this condition decreasing order. Let us be precise. Let π be a permutation of the indices {1, . . . , n}. A distribution D1 , . . . , Dn is sorted according to π if and only if the following Sorting Condition holds: π(i) < π (j )

⇒

∀d ∈ Di , d ∈ Dj

d < d .

(5.20)

In other words, if the distribution is sorted according to π , then all the smallest items must be in xπ(1) , the next smallest ones in xπ(2) , and so on, with the largest ones in xπ(n) . So the requirement that the data are sorted according to the increasing order of the ids of the entities is given by the permutation π = 1 2 . . . n. The requirement of being sorted in a decreasing order is given by the permutation π = n (n − 1) . . . 1. For example, in Figure 5.8(b), the set is sorted according to the permutation π = 2 4 3 1; in fact, all the smallest data items are stored at x2 , the next ones in x4 , the yet larger ones in x3 , and all the largest data items are stored at x1 . We are now ready to deﬁne the problem of sorting a distributed set.

SORTING A DISTRIBUTED SET

299

Sorting Problem Given a distribution D1 , . . . , Dn of D and a permutation π , the distributed sorting problem is the one of moving data items among the entities so that, upon termination, 1. D1 , . . . , Dn is a distribution of D, where Di is the ﬁnal set of data at xi ; 2. D1 , . . . , Dn is sorted according to π. Note that the deﬁnition does not say anything about the relationship between the sizes of the initial sets Di s and those of the ﬁnal sets Di s. Depending on which requirement we impose, we have different versions of the problem. There are three fundamental requirements: invariant-sized sorting: |Di | = |Di |, 1 ≤ i ≤ n, that is, each entity ends up with the same number of items it started with. equidistributed sorting: |Dπ(i) | = N/n for 1 ≤ i < n and |Dπ(n) | = N − (n − 1)N/n, that is, every entity receives the same amount of data, except for xπ(n) that might receive fewer items. compacted sorting: |Dπ(i) | = min{w, N − (i − 1)w}, where w ≥ N/n is the storage capacity of the entities, that is, each entity, starting from xπ(1) , receives as many unassigned items as it can store. Notice that equidistributed sorting is a compacted sorting with w = N/n. For some of the algorithms we will discuss, it does not really matter which requirement is used; for some protocols, however, the choice of the requirement is important. In the following, unless otherwise speciﬁed, we will use the invariant-sized requirement. From the deﬁnition, it follows that when sorting a distributed set the relevant factors are the permutation according to which we sort, the topology of the network in which we sort, the location of the entities in the network, as well as the storage requirements. In the following two sections, we will examine some special cases that will help us understand these factors, their interplay, and their impact. 5.3.2 Special Case: Sorting on a Ordered Line Consider the case when we want to sort the data according to a permutation π , and the network G is a line where xπ(i) is connected to xπ(i+1) , 1 ≤ i < n. This case is very special. In fact, the entities are located on the line in such a way that their indices are ordered according to the permutation π . (The data, however, is not sorted.) For this reason, G is also called an ordered line. As an example, see Figure 5.9, where π = 1, 2, . . . , n. A simple sorting technique for an ordered line is OddEven-LineSort, based on the parallel algorithm odd-even-transposition sort, which is in turn based on the well known serial algorithm Bubble Sort. This technique is composed of a sequence of iterations, where initially j = 0.

300

DISTRIBUTED SET OPERATIONS

{1, 9, 13, 18}

1

{10, 15, 16}

2 {3, 6, 8, 20}

3

{5, 11, 14}

4

5

{2, 7, 12}

FIGURE 5.9: A distribution on a ordered line of size n = 5.

Technique OddEven-LineSort: 1. In iteration 2j + 1 (an odd iteration), entity x2i+1 exchanges its data with neighbour x2i+2 , 0 ≤ i ≤ n2 − 1; as a result, x2i+1 retains the smallest items while x2i+2 retains the largest ones. 2. In iteration 2j (an even iteration), entity x2i exchanges its data with neighbour x2i+1 , 1 ≤ i ≤ n2 − 1; as a result, x2i retains the smallest items while x2i+1 retains the largest ones. 3. If no data items change of place at all during an iteration (other than the ﬁrst), then the process stop. A schematic representation of the operations performed by the technique OddEvenLineSort is by means of the “sorting diagram”: a synchronous TED (time-event diagram) where the exchange of data between two neighboring entities is shown as a bold line connecting the time lines of the two entities. The sorting diagram for a line of n = 5 entities is shown in Figure 5.10. In the diagram are clearly visible the alternation of “odd” and “even” steps. To obtain a fully speciﬁed protocol, we still need to explain two important operations: termination and data exchange. Termination. We have said that we terminate when no data items change of place at all during an iteration. This situation can be easily determined. In fact, at the end of an iteration, each entity x can set a Boolean variable change to true or false to indicate whether or not its data set has changed during that iteration. Then, we can check (by computing the AND of those variables) if no data items have changed place at all during that iteration; if this is the case for every entity, we terminate, else we start the next iteration.

x1 x2 x3 x4

.... .... .... ....

x5

FIGURE 5.10: Diagram of operations of OddEven-LineSort in a line of size n = 5.

301

SORTING A DISTRIBUTED SET

Data Exchange. At the basis of the technique there is the exchange of data between two neighbors, say x and y; at the end of this exchange, that we will call merge, x will have the smallest items and y the largest ones (or vice versa). This speciﬁcation is, however, not quite precise. Assume that, before the merge, x has p items while y has q items, where possibly p = q; how much data should x and y retain after the merge ? The answer depends, partially, on the storage requirements. If we are to perform a invariant-sized sorting, x should retain p items and y should retain q items. If we are to perform a compacted sorting, x should retain min{w, (p + q)} items and y retain the others. If we are to perform a equidistributed sorting, x should retain min{N/n, p + q} items and y retain the others. Notice that, in this case each entity need to know both n and N . The results of the execution of OddEven-LineSort with an invariant-sized in the sorted line of Figure 5.9 is shown in Table 5.2. The correctness of the protocol, although intuitive, is not immediate (Exercises 5.6.23, 5.6.24, 5.6.25, and 5.6.26). In particular, the so-called “0 − 1 principle” (employed to prove the correctness of the similar parallel algorithm) can not be used directly in our case. This is due to the fact that the local data sets Di may contain several items, and may have different sizes. Cost The time cost is clearly determined by the number of iterations. In the worst case, the data items are initially sorted the “wrong” way; that is, the initial distribution is sorted according to permutation π = π(n), π (n − 1), . . . , π(1). Consider the largest item; it has to move from x1 to xn ; as it can only move by one location per iteration, to complete its move it requires n − 1 iterations. Indeed this is the actual cost for some initial distributions (Exercise 5.6.27). Property 5.3.1 OddEven-LineSort sorts an equidistributed distribution in n − 1 iterations if the required sorting is (a) invariant-sized, or (b) equidistributed, or (c) compacted.

TABLE 5.2: Execution of OddEven-LineSort on the System of Figure 5.9 iteration 1 2 3 4 5 6

x1 {1,9,13,18} → {1,3,6,8} {1,3,6,8} → {1,2,3,6} {1,2,3,6} → {1,2,3,5}

x2

x3

← {3,6,8,20} {9,13,18,20} → ← {2,7,9,10} {7,8,9,10} → ← {5,7,8,9} {6,7,8,9} →

{2,7,12} → ← {2,7,10} {13,18,20} → ← {5,11,12} {10,11,12} → ← {10,11,12}

x4 ← {10,15,16} {12,15,16} → ← {5,11,12} {13,18,20} → ← {13,14,15} {13,14,15} →

x5 {5,11,14} ← {5,11,14} {14,15,16} ← {14,15,16} {16,18,20} ← {16,18,20}

302

DISTRIBUTED SET OPERATIONS

Interestingly, the number of iterations can actually be much more than n − 1 if the initial distribution is not equidistributed. Consider, for example, an invariant-sized sorting when the initial distribution is sorted according to permutation π = π(n), π (n − 1), . . . , π(1). Assume that x1 and xn have each kq items, while x2 has only q items. All the items initially stored in x1 must end up in xn ; however, in the ﬁrst iteration only q items will move from x1 to x2 ; because of the “odd-even” alternation, the next q items will leave x1 in the 3rd iteration, the next q in the 5th, and so on. Hence, the total number of iterations required for all data to move from x1 to xn is at least n − 1 + 2(k − 1). This implies that, in the worst case, the time costs can be considerably high (Exercise 5.6.28): Property 5.3.2 OddEven-LineSort performs an invariant-sized sorting in at most N − 1 iterations. This number of iterations is achievable. Assuming (quite unrealistically) that the entire data set of an entity can be sent in one time unit to its neighbor, the time required by all the merge operations is exactly the same as the number of iterations. In contrast to this, to determine termination, we need to compute the AND of the Boolean variables change at each iteration. This operation can be done on a line in time n − 1 at each iteration. Thus, in the worst case, T[OddEven − LineSortinvariant ] = O(nN ).

(5.21)

Similarly, bad time costs can be derived for equidistributed sorting and compacted sorting. Let us focus now on the number of messages for invariant-sized sorting. If we do not impose any size constraints on the initial distribution then, by Property 5.3.2, the number of iterations can be as bad as N − 1; as in each iteration we perform the computation of the function AND, and this requires 2(n − 1) messages, it follows that the protocol will use 2(n − 1)(N − 1) messages just for computing the AND. To this cost we still need to add the number of messages used for the transfer of data items. Hence, without storage constraints on the initial distribution, the protocol has a very high cost due to the high number of iterations possible. Let us consider now the case when the initial distribution is equidistributed. By property 5.3.1, the number of iterations is at most n − 1 (instead of N − 1). This means that the cost of computing the AND is O(n2 ) (instead of O(N n)). Surprisingly, even in this case, the total number of messages can be very high. Property 5.3.3 OddEven-LineSort can use O(N n) messages to perform an invariant-sized sorting. This cost is achievable even if the data is initially equidistributed.

SORTING A DISTRIBUTED SET

303

To see why this is the case, consider an initial equidistribution sorted according to permutation π = π(n), π (n − 1), . . . , π(1). In this case, every data item will change location in each iteration (Exercise 5.6.29), that is, O(N ) messages will be sent in each iteration. As there can be n − 1 iterations with an initial equidistribution (by Property 5.3.1), we obtain the bound. Summarizing: M[OddEven − LineSort]invariant = O(nN ).

(5.22)

That is, using Protocol OddEven-LineSort can costs as much as broadcasting all the data to every entity. This results holds even if the data is initially equidistributed. Similar bad message costs can be derived for equidistributed sorting and compacted sorting. Summarizing, Protocol OddEven-LineSort does not appear to be very efﬁcient. IMPORTANT. Each line network is ordered according to a permutation. However, this permutation might not be π, according to which we need to sort the data. What happens in this case? The protocol OddEven-LineSort does not work if the entities are not positioned on the line according to π, that is, when the line is not ordered according to π . (Exercise 5.6.30). The question then becomes how to sort a set distributed on an unsorted line. We will leave this question open until later in this chapter. 5.3.3 Removing the Topological Constraints: Complete Graph One of the problems we have faced in the the line graph is the constraint that the topology of the network imposes. Indeed, the line graph is one of the worst topologies for a tree, as its diameter is n − 1. In this section we will do the opposite: We will consider the complete graph, where every entity is directly connected to every other entity; in this way, we will be able to remove the constraints imposed by the network topology. Without loss of generality (since we are in a complete network), we assume π = 1, 2, . . . , n. As the complete graph contains every graph as a subgraph, we can choose to operate on whichever graph suites best our computational needs. Thus, for example, we can choose an ordered line and use protocol OddEven-LineSort we discussed before. However, as we have seen, this protocol is not very efﬁcient. If we are in a complete graph, we can adapt and use some of the well known techniques for serial sorting. Let us focus on the classical Merge-Sort strategy. This strategy, in our distributed setting becomes as follows: (1) the distribution to be sorted is ﬁrst divided in two partial distributions of equal size; (2) each of these two partial distribution is independently sorted recursively using MergeSort; and (3) then the two sorted partial distributions are merged to form a sorted distribution. The problem with this strategy is that the last step, the merging step, is not an obvious one in a distributed setting; in fact, after the ﬁrst iteration, the two sorted distributions

304

DISTRIBUTED SET OPERATIONS

to be merged are scattered among many entities. Hence the question: How do we efﬁciently “merge” two sorted distributions of several sets to form a sorted distribution? There are many possible answers, each yielding a different merge-sort protocol. In the following we discuss a protocol for performing distributed merging by means of the odd-even strategy we discussed for the ordered line. Let us ﬁrst introduce some terminology. We are given a distribution D = D1 , . . . , Dn . Consider now a subset {Dj1 , . . . , Djq } of the data sets, where ji < ji+1 (1 ≤ i ≤ q). The corresponding distribution D = Dj1 , . . . , Djq is called a partial distribution of D. We say that the partial distribution d is sorted (according to π = 1, . . . , n) if all the items in Dji are smaller that the items in Dji+1 , 1 ≤ i < q. Note that it might happen that D is sorted while D is not. Let us now describe how to odd-even-merge a sorted partial distribution A1 , . . . , A p with a sorted partial distribution A p +1 , . . . , Ap to form a sorted 2 2 distribution A1 , . . . , Ap , where we are assuming for simplicity that p is a power of 2. OddEven-Merge Technique: 1. If p = 2, then there are two sets A1 and A2 , held by entities y1 and y2 , respectively. To odd-even-merge them, each of y1 and y2 sends its data to the other entity; y1 retains the smallest while y2 retains the largest items. We call this basic operation simply merge. 2. If p > 2, then the odd-even-merge is performed as following: (a) ﬁrst recursively odd-even-merge the distribution A1 , A3 , A5 , . . . , A p −1 2 with the distribution A p +1 , A p +3 , A p +5 , . . . , Ap−1 ; 2

2

2

2

2

2

(b) then recursively odd-even-merge the distribution A2 , A4 , A6 , . . . , A p 2 with the distribution A p +2 , A p +4 , A p +6 , . . . , Ap ; (c) ﬁnally, merge A2i with A2i+1 (1 ≤ i ≤

p 2

− 1)

The technique OddEven-Merge can then be used to generate the OddEven-MergeSort technique for sorting a distribution D1 , . . . , Dn . As in the classical case, the technique is deﬁned recursively as follows: OddEven-MergeSort Technique: 1. recursively odd-even-merge-sort the distribution D1 , . . . , D n2 , 2. recursively odd-even-merge-sort the distribution D n2 +1 , . . . , Dn 3. odd-even-merge D1 , . . . , D n2 with D n2 +1 , . . . , Dn Using this technique, we obtain a protocol for sorting a distribution D1 , . . . , Dn ; we shall call this protocol like the technique itself: Protocol OddEven-MergeSort. To determine the communication costs of this protocol need to “unravel” the recursion.

SORTING A DISTRIBUTED SET

305

x1 x2 x3 x4 x5 x6 x7 x8

FIGURE 5.11: Diagram of operations of OddEven-MergeSort with n = 8.

When we do this, we realize that the protocol is a sequence of 1 + log n iterations (Exercise 5.6.32). In each iteration (except the last) every entity is paired with another entity, and each pair will perform a simple merge of their local sets; half of the entities will perform this operation twice during an iteration. In the last iteration all entities, except x1 and xn , will be paired and perform a merge. Example Using the sorting diagram to describe these operations, the structure of an execution of Protocol OddEven-MergeSort when n = 8 is shown in Figure 5.11. Notice that there are 4 iterations; observe that, in iteration 2, merge will be performed between the pairs (x1 , x3 ), (x2 , x4 ), (x5 , x7 ), (x6 , x8 ); observe further that entities x2 , x3 , x6 , x7 will each be involved in one more merge in this same iteration. Summarizing, in each of the ﬁrst log n iterations, each entity sends is data to one or two other entities. In other words the entire distributed set is transmitted in each iteration. Hence, the total number of messages used by Protocol OddEven-MergeSort is M[OddEven − MergeSort] = O(N log n).

(5.23)

Note that this bound holds regardless of the storage requirement. IMPORTANT. Does the protocol work ? Does it in fact sorts the data ? The answer to these questions is: not always. In fact, its correctness depends on several factors, including the storage requirements. It is not difﬁcult to prove that the protocol correctly sorts, regardless of the storage requirement, if the initial set is equidistributed (Exercise 5.6.33).

306

DISTRIBUTED SET OPERATIONS

{4, 8}

{4, 6}

{1, 4}

{1, 4}

{6}

{8}

{3}

{3}

{7}

{1}

{6}

{6}

{1, 3}

{3, 7}

{7, 8}

{7, 8}

x1 x2 x3 x4

FIGURE 5.12: OddEven-MergeSort does not correctly perform an invariant sort for this distribution.

Property 5.3.4 OddEven-MergeSort sorts any equidistributed set if the required sorting is (a) invariant-sized, (b) equidistributed, or (c) compacted. However, if the initial set is not equidistributed, the distribution obtained when the protocol terminates might not be sorted. To understand why, consider performing an invariant sorting in the system of n = 4 entities shown in Figure 5.12; items 1 and 3, initially at entity x4 , should end up in entity x1 , but item 3 is still at x4 when the protocol terminates. The reason for this happening is the “bottleneck” created by the fact that only one item at a time can be moved to each of x2 and x3 . Recall that the existence of bottlenecks was the reason for the high number of iterations of Protocol OddEven-LineSort. In this case, the problem makes the protocol incorrect. It is indeed possible to modify the protocol, adding enough appropriate iterations, so that the distribution will be correctly solved. The type and the number of the additional iterations needed to correct the protocol depends on many factors. In the example shown in Figure 5.12, a single iteration consisting of a simple merge between x1 and x2 would sufﬁce. In general, the additional requirements depend on the speciﬁcs of the size of the initial sets; see, for example, Exercise 5.6.34. 5.3.4 Basic Limitations In the previous sections we have seen different protocols, examined their behavior, and analyzed their costs. In this process we have seen that the amount of data items transmitted can be very large. For example, in OddEven-LineSort the number of messages is O(Nn), the same as sending every item everywhere. Even not worrying about the limitations imposed by the topology of the network, protocol OddEvenMergeSort still uses O(N log n) messages when it works correctly. Before proceeding any further, we are going to ask the following question: How many messages need to be sent anyway? we would like the answer to be independent of the protocol but to take into account both the topology of the network and the storage requirements. The purpose of this section is to provide such an answer, to use it to assess the solutions seen so far, and to understand its implications. On the basis of this, we will be able to design an efﬁcient sorting protocol. Lower Bound There is a minimum necessary amount of data movements that must take place when sorting a distributed set. Let us determine exactly what costs must be incurred regardless of the algorithm we employ.

SORTING A DISTRIBUTED SET

307

The basic observation we employ is that, once we are given a permutation π according to which we must sort the data, there are some inescapable costs. In fact, if entity x has some data that according to π must end up in y, then this data must move from x to y, regardless of the sorting algorithm we use. Let us state these concepts more precisely. Given a network G, a distribution D = D1 , . . . , Dn of D on G, and a permutation π let D = D1 , . . . , Dn be the result of sorting D according to π . Then |Di ∩ Dj | items must travel from xi to xj ; this means that the amount of data transmission for this transfer is at least |Di ∩ Dj | dG (xi , xj ). How this amount translates into number of messages depends on the size of the messages. A message can only contain a (small) constant number of data items; to obtain a uniform measure, we consider just one data item per message. Then Theorem 5.3.1 The number of messages required to sort D according to π in G is at least |Di ∩ Dj | dG (xi , xj ). C(D, G, π) = i=j

This expresses a lower bound on the amount of messages for distributed sorting; the actual value depends on the topology G and the storage requirements. The determination of this value in speciﬁc topologies for different storage requirements is the subject of Exercises 5.6.35–5.6.38. Assessing Previous Solutions Let us see what this bound means for situations we have already examined. In this bound, the topology of the network plays a role through the distances dG (xi , xj ) between the entities that must transfer data, while the storage requirements play a role through the sizes |Di | of the resulting sets. First of all, note that, by deﬁnition, for all xi , xj , we have dG (xi , xj ) ≤ d(G); furthermore,

|Di ∩ Dj | ≤ N.

(5.24)

i=j

To derive lower bounds on the number of messages for a speciﬁc network G, we need to consider for that network the worst possible allocation of the data, that is, the one that maximizes C(D, G, π ). Ordered Line. OddEven-LineSort Let us focus ﬁrst on the ordered line network.

308

DISTRIBUTED SET OPERATIONS

If the data is not initially equidistributed, it easy to show scenarios where O(N ) data must travel a O(n) distance along the line. For example, consider the case when xn initially contains the smallest N − n + 1 items while all other entities have just a single item each; for simplicity, assume (N − n + 1)/n to be integer. Then for equidistributed sorting we have |Dn ∩ Dj | = (N − n + 1)/n for j < n; this means that at least j n, for example, when N ≥ n2 log n. In contrast, protocol OddEven-MergeSort has always worst-case cost of O(N log n), and it might even not sort. The determination of the cost of protocol SelectSort in speciﬁc topologies for different storage requirements is the subject of Exercises 5.6.41–5.6.48. 5.3.6 Unrestricted Sorting In the previous section we have examined the problem of sorting a distributed set according to a given permutation. This describes the common occurrence when there is some a priori ordering of the entities (e.g., of their ids), according to which the data must be sorted. There are, however, occurrences where the interest is to sort the data with no a priori restriction on what ordering of the sites should be used. In other words, in these cases, the goal is to sort the data according to a permutation. This version of the problem is called unrestricted sorting. Solving the unrestricted sorting problem means that we, as designers, have the choice of the permutation according to which we will sort the data. Let us examine the impact of this choice in some details. We have seen that, for a given permutation π , once the storage requirement is ﬁxed, there is an amount of message exchanges that must necessarily be performed to transfer the records to their destinations; this amount is expressed by Theorem 5.3.1. Observe that this necessary cost is smaller for some permutations than for others. For example, assume that the data is initially equidistributed sorted according to π1 = 1, 2, . . . , n, where n is even. Obviously, there is no cost for an equidistributed sorting of the set according to π1 , as the data is already in the proper place. By contrast, if we need to sort the distribution according to π2 = n, n − 1, . . . , 2, 1, then, even with the same storage requirement as before, the operation will be very costly: At least N messages must be sent, as every data item must necessarily move.

SORTING A DISTRIBUTED SET

313

Thus, it is reasonable to ask that the entities choose the permutation π , which minimizes the necessary cost for the given storage requirement. For this task, we express the storage requirements as a tuple k = k1 , k2 , . . . , kn where kj ≤ w and 1≤j ≤n kj = N : The sites of the sorted distribution D must be such that |Dπ(j ) | = kj . Notice that this generalized storage requirement includes both the compacted (i.e., kj = w) and equidistributed (i.e., kj = N/d) ones, but not necessarily the identical requirement. More precisely, the task we are facing, called dynamic sorting, is the following: given the distribution D, a requirement tuple k = k1 , k2 , . . . , kn , we need to determine the permutation π such that, ∀π,

n n

|Di ∩ Dj (π)| dG (xi , xj ) ≤

i=1 j =1

n n

|Di ∩ Dj (π )| dG (xi , xj ) (5.27)

i=1 j =1

where D (π) = D1 (π), D2 (π), . . . , Dn (π) is the resulting distribution sorted according to π. To determine π we must solve an optimization problem. Most optimization problems, although solvable, are computationally expensive as they are in NP. Surprisingly, and fortunately, our problem is not. Notice that there might be more than one permutation achieving such a goal; in this case, we just choose one (e.g., the alphanumerically smallest). To determine π we need to minimize the necessary cost over all possible permutations π . Fortunately, we can do it without having to determine each D (π ). In fact, regardless of which permutation we eventually determine to be π , because of the storage requirements we know that kj = |Dπ(j )|

items data items must end up in xπ(j ) , 1 ≤ j ≤ n. Hence, we can determine which of xi must be sent to xπ(j ) even without knowing π . In fact, let bj = D[ l≤j kl ] be the (k1 + . . . + kj )th smallest item overall; then all the items d with bj −1 < d ≤ bj must be sent to xπ(j ) . In other words, Di,π(j ) = Di ∩ Dπ(j ) = {d ∈ Di : bj −1 < d ≤ bj }.

This means that we can use the same technique as before: the entities collectively determine the items b1 , b2 , . . . bn employing a distributed selection protocol; then each entity xi uses these values to determine which of its own data items must be sent to xπ(j ) . To be able to complete the task, we do need to know which entity is xπ(j ) , that is, we need to determine π. To this end, observe that we can rewrite expression 5.27 as ∀π,

n n i=1 j =1

|Di,π(j ) | dG (xi , xπ(j ) ) ≤

n n i=1 j =1

|Di,π(j ) | dG (xi , xπ(j ) ).

(5.28)

314

DISTRIBUTED SET OPERATIONS

Strategy DynamicSelectSort begin for j = 1, . . . , n − 1 do Collectively determine bj = D[kj ] using distributed selection; Di,j := {d ∈ Di : bj −1 < d ≤ bj }; ni (j ) := |Di,j |; endfor Di,n := {d ∈ Di : bn−1 < d}; ni (n) := |Di,n |; if xi = x then send ni (1), . . . , ni (n) to x; else wait until receive information from all entities; determine π and notify all entities; endif send Di (j ) to xπ (j ) , 1 ≤ j ≤ n; end FIGURE 5.14: Strategy DynamicSelectSort.

Using this fact, π can be determined in low polynomial time once we know the sizes |Di,π(j ) | as well as the distances dG (x, y) between all pair of entities (Exercise 5.6.49). Therefore, our overall solution strategy is the following: First each entity xi determines the local sets Di (j ) using distributed selection; then, using information about the sizes |Di,j | of those sets and the distances dG (x, y) between entities, a single entity x determines the permutation π that minimizes Expression 5.28; ﬁnally, once π is made known, each entity send the data to their ﬁnal destination. A high level description is shown in Figure 5.14. Missing from this description is the collection at the coordinator x of the distance information; this can be achieved simply by having each entity x send to x the distances from its neighbors N (x). Once all details have been speciﬁed, the resulting Protocol DynamicSelectSorting will enable to sort a distribution according to the permutation, unknown a priori, that minimizes the necessary costs. See Exercise 5.6.50. The additional costs of the protocol are not difﬁcult to determine. In fact, Protocol DynamicSelectSorting is exactly the same as Protocol SelectSort with two additional operations: (1) the collection at x of the distance and size information, and (2) the notiﬁcation by x of the permutation π. The ﬁrst operation requires |N (xi )| + n items of information to be sent by each entity x to x: The |N (xi )| distances from its neighbors and the n sizes |Di,π(j ) |. The second operation consists on sending π which is composed of n items of information. Hence, the cost incurred by Protocol DynamicSelectSorting in addition to that of Protocol SelectSort is: x

(|N (x)| + 2n) dG (x, x).

(5.29)

DISTRIBUTED SETS OPERATIONS

315

Notice that this cost does not depend on the size N of the distributed set, and it is less than the total additional costs of Protocol SelectSort. This means that, with twice the additional cost of Protocol SelectSort, we can sort minimizing the necessary costs. So for example, if the data was already sorted according to some unknown permutation, Protocol DynamicSelectSorting will recognize it, determine the permutation, and no data items will be moved at all. 5.4 DISTRIBUTED SETS OPERATIONS 5.4.1 Operations on Distributed Sets A key element in the functionality of distributed data is the ability to answer queries about the data as well as about the individual sets stored at the entities. Because the data is stored in many places, it is desirable to answer the query in such a way as to minimize the communication. We have already discussed answering simple queries such as order statistics. In systems dealing mainly with distributed data, such as distributed database systems, distributed ﬁle systems, distributed objects systems, and so forth the queries are much more complex, and are typically expressed in terms of primitive operations. In particular, in relational databases, a query will be an expression of join, project, and select operations. These operations are actually operations on sets and can be re-expressed in terms of the traditional operators intersection, union, and difference between sets. So to answer a query of the form “Find all the computer science students as well as those social science students enrolled also in anthropology but not in sociology”, we will need to compute an expressions of the form A ∪ ((B ∩ C) − (B ∩ D))

(5.30)

where A, B, C, and D are the sets of the students in computer science, social sciences, anthropology, and sociology, respectively. Clearly, if these sets are located at the entity x where the query originates, that entity can locally compute the results and generate the answer. However, if the entity x does not have all the necessary data, x will have to involve other entities causing communication. It is possible that each set is actually stored at a different entity, called the owner of that set, and none of them is at x. Even assuming that x knows which entities are the owners of the sets involved, there are many different ways and approaches that can be used to perform the computation. For example, all those sets could be sent by the owners to x, which will then perform the operation locally and answer the query. With this approach, call it A1, the volume of data items that will be moved is Vol(A1) = |A| + |B| + |C| + |D| . The actual number of messages will depend on the size of these sets as well as on the distances between x(A), x(B), x(C), x(D), and x, where x(·) denotes the owner

316

DISTRIBUTED SET OPERATIONS

of the speciﬁed set. In some cases, for example in complete networks, the number of messages is given precisely by these sizes. Another approach is to have x(B) sending B to x(C); x(C) will then locally compute B ∩ C and send it to x(D), which will locally compute (B ∩ C) − (B ∩ D) = (B ∩ C) − D and send it to x(A) that will compute the ﬁnal answer and send it to x. The amount of data moved with this approach, call it A2, is Vol(A2) = |B| + |B ∩ C| + |(B ∩ C) − D| + |A ∪ ((B ∩ C) − D)|. Depending on the sizes of the sets resulting from the partial computations, A1 could be better than A2. Other approaches can be devised, each with its own cost. For example, as (B ∩ C) − D = B ∩ (C − D), we could have x(C) send C to x(D), which will use it to compute C − D and send the result to x(B); if we also have x(A) send A to x(B), x(B) can compute Expression 5.30, and send the result to x. The volume of transmitted items with this approach, call it A3, will be Vol(A3) = |C| + |C − D| + |A| + |A ∪ ((B ∩ C) − D)| . IMPORTANT. In each approach, or strategy, the original expression is broken down into subexpressions, each to be evaluated just at a single site. For example, in approach A2 expression 5.30 is decomposed into three sub-expressions: E1 = (B ∩ C) to be computed by x(C), E2 = E1 − D to be computed by x(D), and E3 = A ∪ E3 to be computed by x(A). A strategy also speciﬁes, for each entity involved in the computation, to what other sites it must send its own set or the results of local evaluations. For example, in approach A2, x(B) must send B to x(C); x(C) must send E1 to x(D); x(D) must send E2 to x(A); and x(A) must send E3 to the originator of the query x. As already mentioned, the amount of items transferred by a strategy depends on the size of the results of the subexpressions (e.g., |B ∩ C|). Typically these sizes are not known a priori; hence, it is in general impossible to know beforehand which of these approaches is better from a communication point of view. In practice, estimates are used on those sizes to decide the best strategy to use. Indeed, a large body of studies exists on how to estimate the size of an intersection or a union or a difference of two or more sets. In particular, an entire research area, called distributed query processing, is devoted to the study of the problem of computing the “best” strategy, and related problems. We can, however, express a lower bound on the number of data that must be moved. As the entity x where the query originates must provide the answer, then, assuming x has none of the sets involved in the query, it must receive the entire answer. That is Theorem 5.4.1 For every expression E, if the set of the entity x where the query originates is not involved in the expression, then for any strategy S Vol(S) ≥ |E|.

DISTRIBUTED SETS OPERATIONS

317

What we will examine in the rest of this section is how we can answer queries efﬁciently by cleverly organizing the local sets. In fact, we will see how the sets can be locally structured so that the computations of those subexpressions (and, thus, the answer to those queries) can be performed minimizing the volume of data to be moved. To perform the structuring, there is need of some information at each entity; if not available, it can be computed in a prestructuring phase. 5.4.2 Local Structure We ﬁrst of all see how we can structure at each entity xi the local data Di so to answer operations of intersections and differences with the minimum amount of communication. The method we use to structure a local set is called Intersection Difference Partioning (IDP). The idea of this method is to store each set Di as a collection Zi of disjoint subsets such that operations of union, intersection, and difference among the data sets can be computed easily, and with the least amount of data transfers. Let us see precisely how we construct the partition Zi of the data set Di . For simplicity, let us momentarily rename the other n − 1 sets Dj (j = i) as S1 , S2 , . . . , Sn−1 . Let us start with the entire set i = Di . Z0,1

(5.31)

i = D ∩ S and Z i = D − S . We ﬁrst of all partition it into two subsets: Z1,1 i 1 i 1 1,2 i Then recursively, we partition Zl,j into two subsets: i i Zl+1,2j −1 = Zl,j ∩ Sl+1

(5.32)

i i = Zl,j − Sl+1 . Zl+1,2j

(5.33)

i ’s; these sets form exactly We continue this process until we obtain the sets Zn−1,j i simply as Zji ; hence the partition of Di we need. For simplicity, we will denote Zn−1,j the ﬁnal partition of Di will be denote by i Zi = Z1i , Z2i , . . . , Zm

(5.34)

where m = 2n−1 . Example Consider the three sets D1 = {a, b, e, f, g, m, n, q}, D2 = {a, e, f, g, o, p, r, u, v} and D3 = {e, f, p, r, m, q, v} stored at entities x1 , x2 , x3 , respectively. i = D ∪ D = {a, e, f, g} and Let us focus on D1 ; it is ﬁrst subdivided into Z1,1 1 2 i Z1,2 = D1 − D2 = {b, m, n, q}. These are then subdivided creating the ﬁnal partition 1 = {e, f }, Z 1 = {a, g}, Z 1 = {m, q}, and Z 1 = {b, n}. Z1 composed of Z2,1 2,2 2,3 2,4

318

DISTRIBUTED SET OPERATIONS

D1 = {a, b, e, f, g, m, n, q}

{a, e, f, g}

{e, f}

D2 = {a, e, f, g, o, p, r, u, v}

{b, m, n, q}

{a, g} {m, q}

{b, n}

{a, e, f, g}

{o, p, r, u, v}

{a, g} {p, r, v}

{e, f}

{o, u}

D3 = {e, f, m, p, q, r, v}

{e, f, m, q}

{e, f}

{p, r, v}

{m, q} {p, r, v}

{}

FIGURE 5.15: Trees created by DSP.

This recursive partitioning of the set Di creates a binary tree Ti . The root (considered to be at level 0) corresponds to the entire sets Di . Each node in the tree i ’s) of this set; note that this subset is possibly corresponds to a subset (one of the Zl,j empty. For a node at level l − 1 corresponding to subset S, its left child corresponds to the subset S ∩ Sj while the right child corresponds to the subset S − Sj . The trees for the three sets of the example above are shown in Figure 5.15. Notice that at each level of the tree (including the last level l = n − 1), the entire set is represented:

i Property 5.4.1 Di = (1≤j ≤2l ) Zl,j i , Z i , . . . , Z i is a partition of D . In other words, Zl,1 i l,2 l,2l Further observe that each level l ≥ 1 of the tree describes the relationship between elements of Di and those in the set Sl . In particular, the sets corresponding to the left children of level l are precisely the elements in common between Di and Sl :

i Property 5.4.2 (1≤j ≤2l−1 ) Zl,2j −1 = Di ∩ Sl

By contrast, the sets corresponding to the right children of level l are precisely the elements in Di that are not part of Sj :

i Property 5.4.3 (1≤j ≤2l−1 ) Zl,2j = Di − Sl i ’s), This means that, if we were to store at xi the entire tree Ti (i.e., all the sets Zl,j then xi can immediately answer any query of the form Di − Dj and Di ∩ Dj for

DISTRIBUTED SETS OPERATIONS

319

any j . In other words, if each xi has available its tree Ti then any query of the form Di − Dj and Di ∩ Dj can be answered by xi without any communication. We are going to see now that it is possible to achieve the same goal storing at xi only the last partition Zi (i.e., the leaves of the tree). Observe that each level l of the tree contains not only the entire set Di but also information about the relationship between Di and all the sets S1 , S2 , . . . , Sl . In particular, the last level l = n − 1 (i.e., the ﬁnal partition), contains information about the relationship between Di and all the other sets. More precisely, the information contained in each node of the tree Ti is also contained in the ﬁnal partition and can be reconstructed from there: i = Property 5.4.4 Zl,j

(1≤k≤2n−1−l )

Zki + (j −1) 2n−1−l

Summarizing, each entity xi structures its local set Di as the collection Zi = i of disjoint subsets created using the IDP method. This collection Z1i , Z2i , . . . , Zm contain all the information contained in each node of the tree Ti . IMPORTANT. Notice that when structuring Di as the partition Zi , the number of data items stored at xi is still |Di |, that is, no additional data items are stored anywhere. 5.4.3 Local Evaluation () Locally Computable Expressions If each xi stores its set Di as the partition Zi , then each entity is immediately capable of computing the result of many expressions involving set operations. For example, we know that the partition Zi contains all the information contained in each node of the tree Ti (Property 5.4.4), thus, by Properties 5.4.2 and 5.4.3 it follows that xi can answer without any communication any query of the form Di − Dj and Di ∩ Dj . In fact,

Di ∩ Sl =

(1≤j ≤2l−1 ,

Di − S l =

Zki + (j −1) 2n−l

(5.35)

Zki + (2j −1) 2n−l−1 .

(5.36)

1≤k≤2n−1−l )

(1≤j ≤2l−1 , 1≤k≤2n−1−l )

Actually, xi has locally available the answer to any expression composed of differences and intersections, involving any number of sets, provided that Di is the left operand in the differences involving Di . So for example, the query (D1 − D2 ) ∩ (D3 − (D4 ∩ D5 )) can be answered immediately both at x1 and x3 (see Exercise 5.6.51). Also some queries involving unions as well as intersections and differences can be answered immediately and locally. For example, both (D1 − (D2 ∩ D3 )) and ((D1 − D2 ) ∩ (D1 ∪ D3 )) can be answered by x1 .

320

DISTRIBUTED SET OPERATIONS

Exactly what expressions can be answered by xi ? To answer this question, observe the following: if expression E can be answered locally by xi , then xi can answer also E ∩ E and E − E , where E is an arbitrary expression on the local sets; if two expressions E1 and E2 can be answered locally by xi , so can be the expressions E1 ∪ E2 .

Using these two facts and starting with Di , we can characterize the set E(xi ) of all the expressions that can be answered by xi directly without communication. Local Evaluation Strategy Let us see now how can xi determine the answer to a query in E(xi ) from the information stored in the ﬁnal partition Zi = i , where m = 2n−1 . Z1i , Z2i , . . . , Zm First of all, let us introduce some terminology. We will call address of Zji the Boolean representation b(j ) of j − 1 using n − 1 bits, for example, in Figure 5.15, 1 = {m, q} has address 10, while 11 is the address of the subset the subset Z2,3 1 Z2,4 . An expression on k operands is sequential if it is of the form ((. . . (((O1 o1 O2 ) o2 O3 ) o3 O4 ) . . .) ok−1 Ok ) where the Oj are the operands and oj are the set operators. An example of a sequential expression is (((A ∪ B) − C) ∪ B). First consider the set E − (xi ) ⊂ E(xi ) of sequential expressions in E(xi ) where 1. Di is the ﬁrst operand, 2. each of the other sets Sj appears at most once, and 3. the only operators are intersection and difference. For example, the expression (((Di ∩ S3 ) − S1 ) ∩ S2 ) is in E − (xi ). To answer queries in E − (xi ) there is a simple strategy that xi can follow: Strategy Bitmask 1. Create a bitmask of size n − 1. 2. For each set Sj (a) if Sj is the right operand of an intersection operator, then place 0 in the jth position of the bitmask; (b) if Sj is the right operand of a difference operator, then place a 1 in the jth position of the bitmask; (c) if Sj is not involved in the query at all, place the wildcard symbol in the jth position of the bitmask.

DISTRIBUTED SETS OPERATIONS

321

3. Perform the union of all the subsets in the ﬁnal partition whose address matches the pattern of the bitmask, where wildcard symbol is matched both by 0 and 1. Example The bitmask associated to expression (((Di ∩ S3 ) − S1 ) ∩ S4 )

(5.37)

when n = 6 will be 0 0 1. Entity xi will then calculate the union of the sets in its ﬁnal partition Zi whose addresses match the bitmask; that is, the sets with address 00001, 00011, 10001, 10011. Thus, to answer query (5.37), xi will just calculate i i ∪ Z36 . Z2i ∪ Z4i ∪ Z34

(5.38)

It is not difﬁcult to verify that indeed by calculating (5.38) we obtain the answer to precisely query (5.38); in fact, the Evaluation Strategy Bitmask is correct (Exercise 5.6.53). Summarizing, using strategy Bitmask entity xi can directly evaluate any expression in E − (xi ); those are, however, only a small subset of all the expressions in E(xi ). Let us now examine how to extend to all queries in E(xi ) the result we have just obtained. The key to the extension is the fact that any expression of E(x) can be re-expressed as the union of sub-expressions in E − (xi ) (Exercise 5.6.54). Property 5.4.5 For every Q ∈ E(x) there are Q(1), . . . , Q(k) ∈ E − (xi ), k ≥ 1, such that Q = 1≤j ≤k Q(j ). For example, (Di − (S2 ∪ S4 )) can be re-expressed as (Di − S2 ) ∪ (Di − S4 ). Similarly ((S1 ∩ S2 ) ∪ Di ) − (S4 ∩ S5 ) = ((Di ∪ S1 ) − S4 − S5 ) ∩ ((Di ∪ S2 ) − S4 − S5 ). Thus, to answer a query in E(xi ), entity xi will ﬁrst re-formulate it as union of expressions in E − (xi ), evaluate each of them using strategy Bitmask and then perform their union. Strategy Local Evaluation 1. Re-formulate Q as union of expressions Q(1), . . . , Q(k) in E − (xi ). 2. Evaluate each Q(j ) using strategy Bitmask. 3. Perform the union of all the obtained results. Notice that all this can be done by xi locally, without any communication.

322

DISTRIBUTED SET OPERATIONS

5.4.4 Global Evaluation Let us now examine the problem of answering a query Q originating at an entity x once every local set Di has been stored as the partition Z i . If the query can be answered directly (i.e., Q ∈ E(x)), x will do so. Otherwise, the query will be decomposed into subqueries that can be locally evaluated at one or more entities, the results of these partial evaluations are then collected at x so that the original query can be answered. Our goal is to ensure that the volume of data items to be moved is minimized. To achieve this goal, we use the following property Property 5.4.6 For every expression Q there are k ≤ n subexpressions Q(1), Q(2), . . . , Q(k) such that 1. ∀Q(j ) ∃yj Q(j ) ∈ E(yj ), 2. Q(i) ∩ Q(j ) = ∅ for i = j ,

3. Q = 1≤j ≤k Q(j ). That is, any query Q can be re-expressed as the union of subqueries Q(1), . . . , Q(k), where each subquery can be answered directly by just one entity, once its local set has been stored using the partitioning method; furthermore, the answer to any two different subqueries is disjoint (Exercise 5.6.55). This gives raise to our strategy for evaluating an arbitrary query: Strategy Global 1. x decomposes Q into Q(1), Q(2), . . . , Q(k) satisfying Property 5.4.6, and informs each yj of Q(j ); 2. yj locally and directly evaluates Q(j ) and sends the result to x; and 3. x computes the union of all the received items. To understand the advantages of this strategy, let us examine again the implications of Property 5.4.6. As the results of any two subqueries are disjoint, while the union of all results of the subqueries is precisely what we are asking for, we have that: Property 5.4.7 Let Q(1), Q(2), . . . , Q(k) satisfy Property 5.4.6 for Q. Then |Q| =

1≤j ≤k

|Q(j )|.

This means that, for every query Q, in our Strategy Global the only data items that might be moved to x are those in the ﬁnal answer, that is, Vol[Global] ≤ |Q|.

BIBLIOGRAPHICAL NOTES

323

In other words, strategy Global is optimal. This optimality is with regards to the amount of data items that will be moved. There are different possible decompositions of a query Q into subqueries satisfying Property 5.4.6. All of them are equally acceptable to our strategy, and they all provide optimal volume costs. IMPORTANT. To calculate the cost in terms of messages we need to take into account also the distances between the nodes in the network. In this regard, some decompositions may be better than others. The problem of determining the decomposition that requires less messages is a difﬁcult one, and no solution is known till date. 5.4.5 Operational Costs An important consideration is that of the cost of setting up the ﬁnal partitions at each entity. Once in this format, we have seen how complex queries can be handled with minimal communication. But to get it in this format requires communication; in fact each entity must somehow receive information from all the other entities about their sets. In a complete network this can require just a single transmission of each set to a predetermined coordinator that will then compute and send the appropriate partition to each entity; hence, the total cost will be O(N ) where N is the total amount of data. By contrast, in a line network the total cost can be as bad as O(N 2 ), for example, if all sets have almost the same size. It is true that this cost is incurred only once, at set-up time. If the goal is only to answer a few queries, the cost of setup may exceed that of simply performing the queries without using the partitioned sets. But for persistent distributed data, upon which many queries may be placed, this is an efﬁcient solution. Another consideration is that of the addition or removal of data from the distributed sets. As each entity contains some knowledge about the contents of all other entities, any time an item is added to or removed from one of the sets, every entity must update its partition to reﬂect this fact. Fortunately, the cost of doing this does not exceed the cost of broadcasting the added (or removed) item to each entity. Clearly this format is more effective for slowly changing distributed data sets. 5.5 BIBLIOGRAPHICAL NOTES The problems of distributed selection and distributed sorting were studied for a small set by Greg Frederickson in special networks (exercises 5.6.1–5.6.3) [4], and by Shmuel Zaks [23]. Always in a small set, the cost using bounded messages and, thus, the bit complexity has been studied by Mike Loui [8] in ring networks; by Ornan Gerstel, Yishay Mansour, and Shmuel Zaks in a star [5]; and in trees by Ornan Gerstel and Shmuel Zaks [6] , and by Alberto Negro, Nicola Santoro, and Jorge Urrutia [12]. Selection among two sites was ﬁrst studied by Michael Rodeh [14]; his solution was later improved by S. Mantzaris [10], and by Francis Chin and Hing Ting [3]. Reducing the expected costs of distributed selection has been the goal of several investigations. Protocol RandomSelect was designed by Liuba Shrira, Nissim Francez,

324

DISTRIBUTED SET OPERATIONS

and Michael Rodeh [21]. Nicola Santoro, Jeffrey Sidney, and Stuart Sidney designed Protocol RandomFlipSelect [19]. Protocol RandomRandomSelect is due to Nicola Santoro, Michael Scheutzow, and Jeffrey Sidney [17]. General selection protocols, with emphasis on the worst case, were developed by Doron Rotem, Nicola Santoro, and Jeffrey Sidney [16], and by Nicola Santoro and Jeffrey Sidney [18]. The more efﬁcient protocol Filter was developed by John Marberg and Eli Gafni [11]. The even more efﬁcient protocol ReduceSelect was later designed by Nicola Santoro and Ed Suen [19]. The sorting protocols Odd-Even Mergesort algorithm, on which Protocols OddEven-LineSort and OddEven-MergeSort are based, was developed by Kenneth Batcher [1]. The ﬁrst general distributed sorting algorithm is due to Lutz Wegner [22]. More recent but equally costly sorting protocols have been designed by To-Yat Cheung [2], and by Peter Hofstee, Alain Martin, and Jan van de Snepscheut [7]; experimental evaluations were performed by Wo-Shun Luk and Franky Ling [9]. The optimal SelectSort was designed by Doron Rotem, Nicola Santoro, Jeffrey B. Sidney [15], who also designed protocol DynamicSelectSort. Other protocols include those designed by Hanmao Shi and Jonathan Schaeffer [20]. There is an extensive amount of investigations on database queries, whose computation requires the use of distributed set operations like union, intersection and difference. The entire ﬁeld of distributed query processing is dedicated to this topic, mostly focusing on the estimation of the size of the output of a set operation and thus of the entire query. The IDP structure for minimum-volume operations on distributed sets was designed and analyzed in this context by Ekow Otoo, Nicola Santoro, Doron Rotem [13].

5.6 EXERCISES, PROBLEMS, AND ANSWERS 5.6.1 Exercises Exercise 5.6.1 () Consider a ring network where each entity has just one item. Show how to perform selection using O(n log3 n) messages. Exercise 5.6.2 () Consider a mesh network where each entity has just one item. 3 Show how to perform selection using O(n log 2 n) messages. Exercise 5.6.3 () Consider a network whose topology is a complete binary tree where each entity has just one item. Show how to perform selection using O(n log n) messages. Exercise 5.6.4 Prove that after discarding the elements greater than mx from Dx and discarding the elements greater than my from Dy , the overall lower median is the lower median of the elements still under considerations.

EXERCISES, PROBLEMS, AND ANSWERS

325

Exercise 5.6.5 Write protocol Halving so that it works with any two arbitrarily sized sets with the same complexity. Exercise 5.6.6 Prove that the K-selection problem can be reduced to a medianﬁnding problem regardless of K and of the size of the two sets. Exercise 5.6.7 Modify protocol Halving as follows: In iteration i, (a) discard from both Dxi and Dyi , all elements greater than max{mix , miy } and all those smaller than min{mix , miy }, where Dxi and Dyi denote the set of elements of Dx and Dy still under consideration at the beginning of stage i, and mix and miy denote their lower medians; (b) transform the problem again into a median ﬁnding one. Write the corresponding algorithm, GeneralHalving, prove its correctness, and analyze its complexity. Exercise 5.6.8 Implement protocol GeneralHalving of Exercise 5.6.7, throughly test it, and run extensive experiments. Compare the experimental results with the theoretical ones. Exercise 5.6.9 () Extend the technique of protocol Halving to work with three sets, Dx , Dy , and Dz . Write the corresponding protocol, prove its correctness, and analyze its complexity. Exercise 5.6.10 Random Item Selection () Modify the protocol of Exercise 2.9.52 so that it can be used to select uniformly at random an element still under consideration in each iteration of Strategy RankSelect. Your protocol should use at most 2(n − 1) + dT (s, x) messages and 2r(s) + dT (s, x) ideal time units in each iteration. Prove both correctness and complexity. Exercise 5.6.11 () Prove that the expected number of iterations performed by Protocol RandomSelect until termination is at most 1.387 log N + O(1). Exercise 5.6.12 () Determine the number of iterations if we terminate protocol RandomSelect, as soon as the search space contains at most cn items, where c is a ﬁxed constant. Determine the total cost of this truncated execution followed by an execution of protocol Rank. Exercise 5.6.13 Prove that in the worst case, the number of iterations performed by Protocol RandomFlipSelect until termination is N . Exercise 5.6.14 () Prove that the expected number of iterations performed by Protocol RandomFlip until termination is less than ln(⌬) + ln(n) + O(1).

326

DISTRIBUTED SET OPERATIONS

Exercise 5.6.15 () Determine the number of iterations if we terminate protocol RandomFlipSelect, as soon as the search space contains at most cn items, where c is a ﬁxed constant. Determine the total cost of this truncated execution followed by an execution of protocol Rank. Exercise 5.6.16 Write Protocol RandomRandomSelect ensuring that each iteration uses at most 4(n − 1) + r(s) messages and 5r(s) ideal time units. Implement the protocol and throughly test your implementation. Exercise 5.6.17 () Prove that the expected number of iterations performed by Protocol RandomRandomSelect until there are less than n items left under consideration is at most 43 log log ⌬ + 1 . Exercise 5.6.18 Prove that the number of iterations performed by Protocol Filter until there are no more than n elements left under consideration is at most 2.41 log(N/n). Exercise 5.6.19 Prove that in the execution of Protocol REDUCE, Local Contraction is executed at the most three times. Exercise 5.6.20 Prove that after the execution of Cutting Tool on C(l = 2i ), only the l − 1 columns C(1), C(2), . . . , C(l − 1) might remain unchanged; all others, including C(l) will have at least n − K/ l of the entries +∞. Exercise 5.6.21 Prove that after the execution of Protocol CUT there will be at most min{n, ⌬} log ⌬ items left under consideration. Exercise 5.6.22 Consider the system shown in Figure 5.9. How many items will x5 have (a) after a compacted sorting with w = 5? (b) after an equidistributed sorting? Justify your answer. Exercise 5.6.23 Prove that OddEven-LineSort performs an invariant-sized sort of an equidistribution on an ordered line. Exercise 5.6.24 () Prove that OddEven-LineSort performs an invariant-sized sort of any distribution on an ordered line. Exercise 5.6.25 () Prove that OddEven-LineSort performs a compacted sort of any distribution on an ordered line.

EXERCISES, PROBLEMS, AND ANSWERS

327

Exercise 5.6.26 () Prove that OddEven-LineSort performs an equidistributed sort of any distribution on an ordered line. Exercise 5.6.27 Prove that OddEven-LineSort sorts an equidistributed distribution in n − 1 iterations regardless of whether the required sorting is invariant-sized, equidistributed, or compacted with all entities having the same capacity. Exercise 5.6.28 Prove that there are some initial conditions under which protocol OddEven-LineSort uses N − 1 iterations to perform invariant-size sorting of N items distributed on a sorted line, regardless of the number n of entities. Exercise 5.6.29 Consider an initial equidistribution sorted according to permutation π = π(n), π (n − 1), . . . , π(1). Prove that, executing protocol OddEven-LineSort in this case, every data item will change location in each iteration. Exercise 5.6.30 Prove that when n > 3, if the line is not sorted according to π , then protocol OddEven-LineSort terminates but does not sort the data according to π . Exercise 5.6.31 Write the set of rules of protocol OddEven-MergeSort. Implement the protocol and throughly test it. Exercise 5.6.32 Prove that protocol OddEven-MergeSort is a sequence of 1 + log n iterations and that in each iteration (except the last) every data item is sent once or twice to another entity. Exercise 5.6.33 Prove that protocol OddEven-MergeSort correctly sorts, regardless of the storage requirement, if the initial set is equidistributed. Exercise 5.6.34 Consider an initial distribution where x1 and xn have the same number K = (N − n + 2)/2 of data items, while all other entities have just a single data item. Augment protocol OddEven-MergeSort so as to perform an invariant sort when π = 1, 2, . . . , n. Show the corresponding sorting diagram. How many additional simple merge operations are needed? How many operations does your solution perform? Determine the time and message costs of your solution. Exercise 5.6.35 For each of the three storage requirements (invariant, equidistributed, compacted) show a situation where ⍀(N ) messages need to be sent to sort in a complete network, even when the data are initially equidistributed. Exercise 5.6.36 Determine for each of the three storage requirements (invariant, equidistributed, compacted) a lower bound, in terms of n and N on the amount of necessary messages for sorting in a ring. What would be the bound for initially equidistributed sets?

328

DISTRIBUTED SET OPERATIONS

Exercise 5.6.37 () Determine for each of the three storage requirements (invariant, equidistributed, compacted) a lower bound, in terms of n and N on the amount of necessary messages for sorting in a labeled hypercube. What would be the bound for initially equidistributed sets? Exercise 5.6.38 () Determine for each of the three storage requirements (invariant, equidistributed, compacted) a lower bound, in terms of n and N on the amount of necessary messages for sorting in an oriented torus. What would be the bound for initially equidistributed sets? Exercise 5.6.39 Show how xπ(i) can ﬁnd out ki at the beginning of the ith iteration of strategy SelectSort. Initially, each entity knows only its index in the permutation (i.e., xπ(i) knows i) as well as the storage requirements. Exercise 5.6.40 Write the set of rules of Protocol SelectSort. Implement and test the protocol. Compare the experimental costs with the theoretical bounds. Exercise 5.6.41 Establish for each of the storage requirements the worst-case cost of protocol SelectSort to sort an equidistributed set in a ordered line. Determine under what conditions the protocol is optimal for this network. Compare this cost with the one of protocol OddEven-LineSort. Exercise 5.6.42 Establish for each of the storage requirements the worst-case cost of protocol SelectSort to sort a distributed set in a ordered line. Determine under what conditions the protocol is optimal for this network. Compare this cost with the one of protocol OddEven-LineSort. Exercise 5.6.43 Establish for each of the storage requirements the worst-case cost of protocol SelectSort to sort an equidistributed set in a ring. Determine under what conditions the protocol is optimal for this network (Hint: Use result of Exercise 5.6.36). Exercise 5.6.44 Establish for each of the storage requirements the worst-case cost of protocol SelectSort to sort a distributed set in a ring. Determine under what conditions the protocol is optimal for this network (Hint: Use result of Exercise 5.6.36). Exercise 5.6.45 Establish for each of the storage requirements the worst-case cost of protocol SelectSort to sort an equidistributed set in a labeled hypercube of dimension d. Determine under what conditions the protocol is optimal for this network (Hint: Use result of Exercise 5.6.37). Exercise 5.6.46 Establish for each of the storage requirements the worst-case cost of protocol SelectSort to sort a distributed set in a labeled hypercube of dimension d. Determine under what conditions the protocol is optimal for this network (Hint: Use result of Exercise 5.6.37).

EXERCISES, PROBLEMS, AND ANSWERS

329

Exercise 5.6.47 Establish for each of the storage requirements the worst-case cost of protocol SelectSort to sort an equidistributed set in a oriented torus of dimension p × q. Determine under what conditions the protocol is optimal for this network. (Hint: Use result of Exercise 5.6.38). Exercise 5.6.48 Establish for each of the storage requirements the worst-case cost of protocol SelectSort to sort a distributed set in a oriented torus of dimension p × q. Determine under what conditions the protocol is optimal for this network (Hint: Use result of Exercise 5.6.38). Exercise 5.6.49 Show how in strategy DynamicSelectSort the coordinator x can determine π from the received information in O(n3 ) local processing activities. Exercise 5.6.50 Write the set of rules of Protocol DynamicSelectSorting. Implement and test the protocol. Compare the experimental costs with the theoretical bounds. Exercise 5.6.51 Prove that the query (D1 − D2 ) ∩ (D3 − (D4 ∩ D5 )) can be answered immediately at both x1 and x3 if each of the sets is stored by its entity using the DSP method. Exercise 5.6.52 Show that expressions 5.38 and 5.38 are equal. Exercise 5.6.53 Prove that using strategy Bitmask, entity xi can directly evaluate any expression in E − (xi ). Exercise 5.6.54 () Prove Property 5.4.5: Any expression of E(x) can be reexpressed as the union of sub-expressions in E − (xi ). Exercise 5.6.55 () Prove Property 5.4.6. 5.6.2 Problems Problem 5.6.1 () Design a generic protocol to perform selection in a small set using o(n2 ) messages in the worst case. 5.6.3 Answers to Exercises Partial Answer to Exercise 5.6.4. Among the 2p−1 elements removed from consideration, exactly 2p−2 are greater than the median while exactly 2p−2 are smaller than the median. Answer to Exercise 5.6.13. Without loss of generality, let K ≤ N − K + 1. Then, for the ﬁrst N − 2K + 2 iterations, the adversary will choose d(i) to be the largest item in the search space. In this way, only d(i) will be removed from the search space in that iteration;

330

DISTRIBUTED SET OPERATIONS

furthermore, we still have K(i + 1) ≤ N (i + 1) − K(i + 1) + 1 where K(i) and N(i) are the rank of d ∗ and the size of the search space at the beginning of iteration i. As in these iterations we are removing only elements larger than d ∗ , after the N − 2K + 1 iterations d ∗ is the median of the search space. At this point, the adversary will alternate selecting d(i) to be the smallest item in the search space in one iteration and the largest item in the next one. In this way, only d(i) will be removed and d ∗ continues to be the (lower) median of the search space. Hence, the additional number of iterations is exactly 2K − 2, for a total of N iterations. Partial Answer to Exercise 5.6.18. Show that at least 1/4 of the items are removed from consideration at each iteration. Partial Answer to Exercise 5.6.19. Let K(j ) and N(j ) be the rank of f ∗ in the search space and the size of the search space at the end of iteration j of the while loop in Protocol REDUCE. Call an iteration a ﬂip if ⌬(j ) = N (j − 1) − ⌬(j − 1) + 1 < ⌬(j − 1). First of all observe that if the (j + 1)th iteration is not a ﬂip, then it is the last iteration. Let the (j + 1)th iteration be a ﬂip, and let q(j + 1) be the number of entities whose local search space is reduced in this iteration; q(j + 1) must be at least 1, otherwise the iteration would not be a ﬂip. We will show that q(j + 1) = 1. By contradiction, if q(j + 1) > 1, there must be at least two entities x and y that will have their search space reduced in iteration (j + 1). That is, N (x, j ) > ⌬(j ) and N (y, j ) > ⌬(j ) where N (x, j ) and N (y, j ) denote the number of items still under consideration at x and y, respectively, at the end of the jth iteration. Then N(j ) ≥ N(x, j ) + N (y, j ) ≥ 2⌬(j ). This means that N (j ) − ⌬(j ) + 1 > ⌬(j ), which implies that ⌬(j + 1) = min{⌬(j ), N (j ) − ⌬(j ) + 1} = ⌬(j ), contradicting the fact that iteration (j + 1) is a ﬂip. Hence, q = 1, that is, if iteration (j + 1) is a ﬂip, only one entity will reduce its search space in that iteration. To complete the proof, we must prove that the jth and the (j + 1)th iterations cannot both be ﬂips. Answer to Exercise 5.6.22. (a) none; (b) one. Answer to Exercise 5.6.28. Consider the initial condition where the initial distribution is sorted according to n, n − 1, . . . , 1. Let x1 and xn each contain (N − n + 2)/2 items, while all other entities have only one item each. Then trivially, in the each odd iteration only one item can leave x1 . Hence, the last item to move from x1 to xn will do so in the (N − n + 2)/2th odd iteration, which is the (N − n + 1)th iteration overall; this item reaches xn after an additional n − 2 iterations. Hence, the claimed N − 1 total number of iterations before termination. Answer to Exercise 5.6.30. Without loss of generality let π = 1, 2, . . . , n. If the line is not sorted according to π , then there is an entity xi whose neighbors in the line, y and z, have indices

BIBLIOGRAPHY

331

“greater” (respectively “smaller”) than it, that is, y = xj and z = xk where both j and k are greater (respectively, smaller) than i. Without loss of generality let j > k (respectively, j < k); that is, once sorted, the data stored in y must be greater (respectively smaller) than the data stored in z. Among the data initially stored at z, include the largest data item D[N ] (respectively the smallest item D[1]). For the data to be sorted, this item must move from z = xk to y = xj , passing through xi . However, as k > i (respectively k < i), according to the protocol z will never send D[N] (respectively D[1]) to xi . Answer to Exercise 5.6.39. If the storage requirement is invariant sized, then ki = |Dπ(i) |, which is known to xπ(i) . If the requirement is equidistributed, then the entities need to know N/n; both n and N, if not already known can be easily acquired (e.g., using saturation on a spanning-tree). Then, ki = N/n for 1 ≤ i ≤ n − 1. If the storage requirement is compacted with parameter w, then ki = w for 1 ≤ i ≤ N/w, while ki = 0 for i > N/w. Again, knowing N allows each entity to know what the size of its ﬁnal set of data items. Answer to Exercise 5.6.49. to xk all the data items that must end up Observe that if π(j ) = k, then to transfer there requires the transmission of βj →k = nj=1 |Di,j | dG (xi , xk ) messages. Deﬁne variables zj,k to be equal to 1 if π(j ) = k, 0 otherwise. Then minimization of e ex pression 5.28 reduces to ﬁnding a 0 − 1 solution for the linear programming assignment problem: Minimize g[Z] = n k=1 n j =1

n n j =1 k=1

βj →k zj,k

zj,k = 1 (1 ≤ j ≤ n) zj,k = 1 (1 ≤ k ≤ n)

zj,k ≥ 0 (1 ≤ j, k ≤ n).

A single entity can solve this problem in O(n3 ) local processing activities once the βj →k ’s are available at that entity.

BIBLIOGRAPHY [1] K.E. Batcher. Sorting networks and their applications. In AFIPS Spring Joint Computer Conference, pages 307–314, 1968. [2] To-Yat Cheung. An algorithm with decentralized control for sorting ﬁles in a network. Journal of Parallel and Distributed Computing, 7(3):464–481, 1989.

332

DISTRIBUTED SET OPERATIONS

[3] F. Chin and H.F. Ting. An improved algorithm for ﬁnding the median distributively. Algorithmica, 2:235–249, 1987. [4] G.N. Frederickson. Distributed algorithms for selection in sets. Journal of Computing and System Science, 37(3):337–348, 1988. [5] O. Gerstel, Y. Mansour, and S. Zaks. Bit complexity of order statistics on a distributed star network. Information Processing Letters, 30(3):127–132, 1989. [6] O. Gerstel and S. Zaks. The bit complexity of distributed sorting. Algorithmica, 18: 405–416, 1997. [7] H.P. Hofstee, A.J. Martin, and J.L.A. van de Snepscheut. Distributed sorting. Science of Computer Programming, 15(2–3):119–133, 1990. [8] M.C. Loui. The complexity of sorting on distributed systems. Information and Control, 60:70–85, 1984. [9] W.S. Luk and Franky Ling. An analytical/empirical study of distributed sorting on a local area network. IEEE Transactions on Software Engineering, 15(5):575–586, 1989. [10] S.L. Mantzaris. An improved algorithm for ﬁnding the median distributively. Algorithmica, 10(6):501–504, 1993. [11] J.M. Marberg and E. Gafni. Distributed sorting algorithms for multi-channel broadcast networks. Theoretical Computer Science, 52(3):193–203, 1987. [12] A. Negro, N. Santoro, and J. Urrutia. Efﬁcient distributed selection with bounded messages. IEEE Transaction on Parallel and Distributed Systems, 8:397–401, 1997. [13] E.J. Otoo, N. Santoro, and D. Rotem. Improving semi-joint evaluation in distributed query processing. In 7th International Conference on Distributed Computing Systems., pages 554–561, sept 1987. [14] M. Rodeh. Finding the median distributively. Journal of Computing and Systems Science, 24(2):162–167, 1982. [15] D. Rotem, N. Santoro, and J. B. Sidney. Distributed sorting. IEEE Transaction on Computers, 34:372–376, 1985. [16] D. Rotem, N. Santoro, and J.B. Sidney. Shout-echo selection in distributed ﬁles. Networks, 16:77–86, 1986. [17] N. Santoro, M. Scheutzow, and J.B. Sidney. On the expected complexity of distributed selection. Journal of Parallel and Distributed Computing, 5:194–203, 1988. [18] N. Santoro and J.B. Sidney. Order statistics on distributed sets. In 20th Allerton Conf. on Communication, Control and Computing, pages 251–256, 1982. [19] N. Santoro and E. Suen. Reduction techniques for selection in a distributed ﬁle. IEEE Transactions on Computers, 38(6):891–896, 1989. [20] H. Shi and J. Schaeffer. Parallel sorting by regular sampling. Journal of Parallel and Distributed Computing, 14(4):361–372, 1992. [21] L. Shrira, N. Francez, and M. Rodeh. Distributed k-selection: From a sequential to a distributed algorithm. In 2nd ACM Sympsium on Principles of Distributed Computing, pages 143–153, 1983. [22] L.M. Wegner. Sorting a distributed ﬁle in a network. Computer Networks, 8(5/6):451–462, December 1984. [23] S. Zaks. Optimal distributed algorithms for sorting and ranking. IEEE Transactions on Computers, 34:376–380, 1985.

CHAPTER 6

Synchronous Computations

6.1

SYNCHRONOUS DISTRIBUTED COMPUTING

6.1.1 Fully Synchronous Systems In the distributed computing environments we have considered so far, we have not made any assumption about time. In fact, from the model, we know only that in absence of failure, a message transmitted by an entity will eventually arrive to its neighbor: the Finite Delays axiom. Nothing else is speciﬁed, so we do not know for example how much time will a communication take. In our environment, each entity is endowed with a local clock; still no assumption is made on the functioning of these clocks, their rate, and how they relate to each other or to communication delays. For these reasons, the distributed computing environments described by the basic model are commonly referred to as fully asynchronous systems. They represent one extreme in the spectrum of message-passing systems with respect to time. As soon as we add temporal restrictions, making assumptions on the the local clocks and/or communication delays, we describe different systems within this spectrum. At the other extreme are fully synchronous systems, distributed computing environments where there are strong assumptions both on the local clocks and on communication delays. These systems are deﬁned by the following two restrictions about time: Synchronized Clocks and Bounded Transmission Delays. Restriction 6.1.1 Synchronized Clocks All local clocks are incremented by one unit simultaneously. In other words, all local clocks ‘tick’ simultaneously. Notice that this assumption does not mean that the clocks have the same value, but just that their value is incremented at the same time. Further notice that the interval of time between consecutive increments in general need not be constant. For simplicity, in the following we will assume that this is the case and denote by δ the constant; see Figure 6.1.

Design and Analysis of Distributed Algorithms, by Nicola Santoro Copyright © 2007 John Wiley & Sons, Inc.

333

334

SYNCHRONOUS COMPUTATIONS

FIGURE 6.1: In a fully synchronous system, all clocks tick periodically and simultaneously, and there is a known upperbound ⌬ on communication delays.

By Convention, 1. entities will transmit messages (if needed) to their neighbors only at the strike of a clock tick; 2. at each clock tick, an entity will send at most one message to the same neighbor. Restriction 6.1.2 Bounded Communication Delays There exists a known upper bound on the communication delays experienced by a message in absence of failures. In other words, there is a constant ⌬ such that in absence of failures, every message sent at time T will arrive and be processed by time T + ⌬. In terms of clock ticks, this means that in absence of failures, every message sent at local clock tick t will arrive and be processed by clock tick t + ⌬ δ (sender’s time); see Figure 6.1. Summarizing, a fully synchronous system is a distributed computing environment where both the above restrictions hold. Notice that knowledge of ⌬ can be replaced by knowledge of ⌬ δ . 6.1.2 Clocks and Unit of Time In a fully synchronous system, two consecutive clock ticks constitute a unit of time, and we measure the time costs of a computation in terms of the number of clock ticks elapsed from the time the ﬁrst entity starts the computation to the time the last entity terminates its participation in the computation. Notice that, in this “clock time,” there is an underlying notion of “real time” (or physical time), one that exists outside the system (and independent of it), in terms of which we express the distance δ between clock ticks as well as the bound ⌬ on communication delays. We can redeﬁne the unit of time to be composed of u > 1 consecutive clock ticks. In other words, we can deﬁne new clock ticks, each comprising u old ones, and act accordingly. In particular, each entity will only send messages at the beginning of

SYNCHRONOUS DISTRIBUTED COMPUTING

335

FIGURE 6.2: Redeﬁne the clock ticks so that the delays are unitary.

a new time unit and does not send more than one message to the same neighbor in each new time unit. Clearly, the entities must agree on when the new time unit starts. After the transformation, we can still measure time costs of a computation correctly: If the execution of a protocol lasts K new time units, its time cost is uK original clock ticks. Observe that if we choose u = ⌬ δ (Figure 6.2), then with the new clocks communication delays become unitary: If an entity x sends a message at the (new) local clock tick t to a neighbor, in absence of failures, the message is received and processed there at the (new) clock tick t + 1 (sender’s time). In other words, any fully synchronous system can be transformed so as to have unitary delays.

This means that we can assume, without loss of generality, that the following restriction holds: Restriction 6.1.3 Unitary Communication Delays In absence of failures, a transmitted message will arrive and be processed after at most one clock tick. The main advantage of doing this redeﬁnition of unit of time is that it greatly simpliﬁes the design and analysis of protocols for fully synchronous systems. In fact, it is common to ﬁnd fully synchronous systems deﬁned directly as having unitary delays. IMPORTANT. In the following, the pair of Restrictions 6.1.1 and 6.1.3, deﬁning a fully synchronous system with unitary delay, will be denoted simply by Synch.

336

SYNCHRONOUS COMPUTATIONS

6.1.3 Communication Delays and Size of Messages A fully synchronous system, by deﬁnition, guarantees that, in absence of failures, any allowed message will encounter bounded delays. More precisely, by deﬁnition, for any message M, the communication delay τ (M) encountered by M in absence of failures will always be τ (M) ≤ ⌬.

(6.1)

Notice that this must hold regardless of the size (i.e., the number of bits) of M. Let us examine this fact carefully. By Restriction 6.1.2, ⌬ is bounded. For ⌬ to be bounded τ (M) must be bounded. This fact implies that the size of M must be bounded: To assume otherwise means that the system allows communication of unbounded messages in bounded time, an impossibility. This means, Property 6.1.1 Bounded messages In fully synchronous systems, messages have bounded length. In other words, there exists a constant c (depending on the system) such that each message will contain at most c bits. Bounded messages are also called packets and the constant c is called packet size. IMPORTANT. The packet size c is a system parameter. It could be related to other system parameters such as n (the network size) or m (the number of links). However, it cannot depend on input values (unless they are also bounded). The bounded messages property has important practical consequences. It implies that if the information an entity x must transmit does not ﬁt in a packet, that information must be “split up” and transmitted using several packets. More precisely, the transmission of w > c bits to a neighbor actually requires the transmission of M[w] messages where M[w] ≥ w c . This fact affects not only the message costs but also the time costs. As at most one message can be sent to a neighbor at a given clock tick, the number of clock ticks required by the transmission of w > c bits is CT[w] ≥ w c . 6.1.4 On the Unique Nature of Synchronous Computations Fully synchronous computing environments are dramatically different from the asynchronous ones we have considered so far. The difference is radical and provides 1

that is, it goes to the roots

SYNCHRONOUS DISTRIBUTED COMPUTING

337

the protocol designer working in a fully synchronous environment with computational means and tools that are both unique and very powerful. In the following we will brieﬂy describe two situations providing an insight in the unique nature of synchronous computations. Overcoming Lower Bounds: Different Speeds As a ﬁrst example of a synchronous algorithm, we will discuss a protocol for leader election in synchronous rings. We assume the standard restrictions for elections (IR), as well as Synch; the goal is to elect as leader the candidate with the smallest value. The protocol is essentially AsFar with an interesting new idea. Recall that in AsFar each entity originates a message with its own id, forwards only messages with the smallest id seen so far, and trashes all the other incoming messages. The message with the smallest value will never be trashed; hence it will make a full tour of the ring and return to its originator; every other message will be trashed by the ﬁrst entity with a smaller id it encounters. We have seen that this protocol has an optimal message complexity on the average but uses O(n2 ) messages in the worst case. The interesting new idea is to have each message travel along the ring at a different speed, proportional to the id it contains, so that messages with smaller ids travel faster than those with larger values. In this way, a message with a small id can “catch up” with a slower message (containing a larger id); when this happens, the message with the larger id will be trashed. In other words, a message with a large id is trashed not only if it reaches an entity aware of a smaller id but also if it is reached by a message with a smaller id. However, in a synchronous system, every message transmission will take at most one time unit; so, in a sense, all messages travel at the same speed. How can we implement variable speeds in a synchronous system? The answer is simple: (a) When an entity x receives a message with a value i smaller than any seen so far by x, instead of immediately forwarding the message along the ring (as the protocol AsFar would require), x will hold this message for an amount of time (i.e., a number of clock ticks) f (i) directly proportional to the value i. (b) If a message with a smaller value arrives at x during this time, x will remove i from consideration and process the new value. Otherwise, after holding i for f (i) clock ticks, x will forward it along the ring. The effect is that a message with value i will be effectively traveling along the ring at speed 1 + f (i): If originally sent at time 0, it will be sent at time 1 + f (i) to the next entity, and again at time 2 + 2f (i), 3 + 3f (i), and so on, until it is trashed or completes the tour of the ring. In this simple way, a we have implemented both variable speeds and the “catch-up” of slow messages by faster ones! The correctness of this new protocol follows from the fact that again, the message with the smallest id will never be trashed and will thus return to its originator; every

338

SYNCHRONOUS COMPUTATIONS

other message will be trashed either because of arriving to an entity that has seen a smaller id or because of being reached by a message with a smaller id. To determine the cost of the protocol, called Speed, obviously we must take care of several implementation details (variables, bookkeeping, start, speed, etc.), but the basic mechanism is there. Let us assume for the moment that all entities are initially candidates and start at the same time. For every choice of the monotonically increasing speed function f we will obtain a different cost. In particular, by choosing f (i) = 2i , we have a very interesting situation. In fact, by the time (the message with) the smallest id i1 has traveled all along the ring causing n transmissions, the second smallest i2 could have traveled at most halfway the ring causing n/2 transmissions, the third smallest could have traveled at most n/4, and in general the j th smallest could have traveled at most distance 2jn−1 . In other words, with this choice of speed function, the total number of transmissions until the entity with smallest value becomes leader is n j =1

n 2j −1

< 2n.

As the protocol will just need an additional n messages for the ﬁnal notiﬁcation, we have M[Speed] = O(n).

(6.2)

This result is remarkable: This message complexity is lower than the ⍀(n log n) lowerbound for leader election in asynchronous rings ! It clearly shows a fundamental complexity difference between synchronous and asynchronous systems. To achieve this result, we have used time directly as a computational tool: to implement the variable speeds of the messages and to select the appropriate waiting function f . The result must be further qualiﬁed; in fact, it is correct assuming that the entity values are small enough to ﬁt into a packet. In other words, it is correct but only if provided that the input values are bounded by 2c ; we will denote this additional restriction on the size of the input by InputSize(2c ). To have a better understanding of the amount of transmissions, we can measure the number of bits: B[Speed] = O(n log i),

(6.3)

where i is the range of the input values. We have assumed that all entities start at the same time. This assumption is not essential: It sufﬁces that we ﬁrst perform a wake-up, and elect a leader only among

SYNCHRONOUS DISTRIBUTED COMPUTING

339

PROTOCOL Speed

States: S = {ASLEEP, CANDIDATE, RELAYER, FOLLOWER, LEADER}; SINIT = {ASLEEP}; STERM = {FOLLOWER, LEADER}.

Restrictions: RI ∪ Synch ∪ Ring ∪ InputSize(2c ). ASLEEP Spontaneously begin min:= id(x); send("FindMin", min) to right; become CANDIDATE; end Receiving("FindMin", id ) begin min:= id ; send("FindMin", min) to other; become RELAYER; end CANDIDATE Receiving("FindMin", id ) begin if id < min then PROCESS-MESSAGE; become RELAYER else if id = id(x) then send(Notify) to other; become LEADER endif; endif end W hen(c(x) = alarm) begin send("FindMin", min) to direction; end Receiving(Notify) begin send(Notify) to other; become FOLLOWER; end

FIGURE 6.3: Protocol Speed.

the spontaneous initiators (i.e., the others will not originate a message but will still actively participate in the trashing and waiting processes). The election messages themselves can act as “wake-up” messages, traveling at normal (i.e., unitary) speed until they reach the ﬁrst spontaneous initiator, and only then traveling at the assigned speed. In this way, we still obtain a O(n) message complexity (Exercise 6.6.3).

340

SYNCHRONOUS COMPUTATIONS

RELAYER

Receiving("FindMin", id ) begin if id < min then PROCESS-MESSAGE; endif end W hen(c(x) = alarm) begin send("FindMin", min) to direction; end Receiving(Notify) begin send(Notify) to other; become FOLLOWER; end

Procedure PROCESS-MESSAGE begin min:= id ; direction:= sender; set alarm:= c(x) + f (id*); end

FIGURE 6.4: Rule for Relayer and Procedure Process-Message used by protocol Speed.

The modiﬁed protocol Speed is shown in Figures 6.3 and 6.4; c(x) denotes the local clock of the entity x executing the protocol, and W hen denotes the external event of the alarm clock ringing. Beyond the Scenes The results expressed by Equations 6.2 and 6.3 do not tell the whole story. If we calculate the time consumed by protocol Speed we ﬁnd (Exercise 6.6.4) that T[Speed] = O(n2i ).

(6.4)

In other words, the time is exponential. It is actually worse than it sounds. In fact, it is exponential not in n (a system parameter) but in the range i of the input values. Overcoming Transmission Costs: 2-bit Communication We have seen how, in a synchronous environment, the lowerbounds established for asynchronous problems do not necessarily hold. This is because of the additional computational power of fully synchronous systems.

SYNCHRONOUS DISTRIBUTED COMPUTING

341

FIGURE 6.5: Entity x sends only two packets.

The most clear and (yet) surprising example of the difference between synchronous and asynchronous environments is the one we will discuss now. Consider an entity x that wants to communicate to a neighbor y some information, unknown to y. Recall that in a fully synchronous If I system messages are bounded: w packets and therefore at least want to transmit w bits, I will have to send w c c time units or clock ticks. Still, x can communicate the information to y transmitting only two packets (!), regardless of the packet size (!!) and regardless of the information (!!!), provided it is ﬁnite. Property 6.1.2 In absence of failures, any ﬁnite sequence of bits can be communicated transmitting two messages, regardless of the message size. Let us see how this extraordinary result is possible. Let α be the sequence of bits that x wants to communicate to y; let 1α be the sequence α preﬁxed by the bit 1 (e.g., if α = 011, then 1α = 1011. Let I (1α) denote the integer whose binary encoding is 1α; for example, T (1011) = 11. Consider now the following protocol: PROTOCOL TwoBits. 1. Entity x (see Figure 6.5): (a) it sends to y a message “Start-Counting”; (b) it waits for I (1α) clock ticks, and then (c) sends a message “Stop-Counting”. 2. Entity y (Figure 6.6) : (a) upon receiving the “Start-Counting” message, it records the current value c1 of the local clock; (b) upon receiving the “Start-Counting” message, it records the current value c2 of the local clock. Clearly c2 − c1 = I (1α), from which α can be reconstructed. As the message size is irrelevant and the string 1α is ﬁnite but arbitrary, the property states that in absence of failures, any ﬁnite amount of information can be communicated by transmitting just 2 bits!

342

SYNCHRONOUS COMPUTATIONS

FIGURE 6.6: Entity y can reconstruct the information.

IMPORTANT. In synchronous computing there is a difference between communication and transmission. In fact, unlike asynchronous systems where transmission of messages is the only way in which neighboring entities can communicate, in synchronous systems absence of transmission can be used to communicate information, as we have just seen. In other words, in synchronous systems silence is expressive. This is the radical difference between synchronous and asynchronous computing environments. We will investigate how to exploit it in our designs. Beyond the Scenes The property, as stated, is incomplete from a complexity point of view. In fact, in a synchronous system, time and transmission complexities are intrinsically related to a degree nonexistent in asynchronous systems. In the example above, the constant bit complexity is achieved at the cost of a time complexity that is exponential in the length of the sequence of bits to be communicated, In fact, x has to wait I (1α) time units, but 2|α| ≤ I (1α) ≤ 2|α|+1 − 1, where |α| denotes the size (i.e., the number of bits) of α. Once again, there is an exponential time cost to be paid for the the remarkable use of time. 6.1.5 The Cost of Synchronous Protocols In a fully synchronous system, time and transmission complexities are intrinsically related to a degree nonexistent in asynchronous systems. As we have discussed in the subsection “Beyond the Scenes” of Section 6.1.4, to say “we can solve the election in a ring with O(n) messages” or “we can communicate the Encyclopædia Britannica transmitting 2 bits” is correct but incomplete. We have been able to achieve those results because we have used time as a computational element; however, time must be charged, and the protocol must pay for it.

COMMUNICATORS, PIPELINE, AND TRANSFORMERS

343

In other words, the cost of a fully synchronous protocol is both time and transmissions. More precisely, the communication cost of a fully synchronous protocol P is a couple P, T, where P denotes the number of packets and T denotes the number of time units. We will more often use the number of bits B instead of P; thus, our common measure will be the couple Cost[P ] = B[P ], T[P ]. So, for example, the complexity of Protocol Speed is Cost[Speed(i)] = O(n log i), O(n2i ) and that of Protocol TwoBits is C[TwoBits(α)] = 2, O(2|α| ). Summarizing, the cost of a fully synchronous protocol is both time and bits. In general, we can trade off one for the other, transmitting more bits to use less time, or vice versa, depending on our design goals.

6.2 COMMUNICATORS, PIPELINE, AND TRANSFORMERS In a system of communicating entities, the most basic and fundamental problem is obviously the process of an entity, the sender efﬁciently and accurately communicating information to another entity, the receiver. If these two entities are neighbors, this problem is called Two-Party Communication (TPC) problem. In an asynchronous system, this problem has only one solution: The sender puts the information into messages and transmits those messages. In fully synchronous systems, as we have already observed, transmission of bits is not the only way of communicating information; for example, in a fault-free system, if no bit is received at local time t + 1, then none was transmitted at time t. Hence, absence of transmission, or silence, is detectable and can be used to convey information. In fact, there are many possible solutions to the Two-Party Communication problem, called communicators, each with different costs. We have already seen one, Protocol TwoBits. In this section we will examine the design of efﬁcient communicators. Owing to the basic nature of the process, the choice of a communicator will greatly affect the overall performance of the higher level protocols employed in the system. We will then discuss the problem of communicating information at a distance, that is, when the sender and the receiver are not neighbors. We will see how this and related problems can be efﬁciently solved using a technique well known in very large scale integration (VLSI) and parallel systems: pipeline. We will also examine the notion of asynchronous-to-synchronous transformer, a “compiler” that given in input an asynchronous protocol solving a problem P

344

SYNCHRONOUS COMPUTATIONS

FIGURE 6.7: For the sender, a quantum is the number of clock ticks between two successive transmissions; for the receiver, it is the interval between two successive arrivals.

generates an efﬁcient synchronous protocol solving P. Such a transformer is a useful tool to solve problems for which an asynchronous solution is already known. Communicators are an essential component of a transformer; in fact, as we will see, different communicators result in different costs for the generated synchronous protocol. This is one more reason to focus on the design of efﬁcient communicators. In the following, we will assume that no failure will occur, that is, we operate under restriction Total Reliability. 6.2.1 Two-Party Communication Consider the simple task of an entity, the sender, communicating information to a neighbor, the receiver. At each time unit, the sender can either transmit a packet or remain silent; a packet transmitted by the sender at time t will be received and processed by the receiver at time t + 1 (sender’s time). The interval of time between two successive transmissions by the sender is called a quantum of silence (or, simply, quantum); if there are no failures, the interval of time between the two arrivals will be the same for the receiver (see Figure 6.2.1). The quantum is zero if the packets are sent at two consecutive clock ticks. Thus, to communicate information, the sender can use not only the transmission of several packets, but also the quanta of silence between successive transmissions. For example, in the TwoBits protocol, the sender was using the transmission of two packets as well as the quantum of silence between them. In general, the transmission of k packets p0 , p1 , . . . , pk−1 deﬁnes k − 1 quanta q1 , q2 , . . . , qk−1 , where qi is the interval between the transmissions of pi−1 and pi , 1 ≤ i ≤ k − 1. The ordered sequence p0 : q1 : p1 : ... : qk−1 : pk−1 we will called communication sequence. Clearly, there are many different ways in which we can design a protocol for the two entities to communicate using transmissions and silence, depending on the value of k we choose, the content of the packets, the size c of the packets, and so forth. Each design will yield a different cost.

COMMUNICATORS, PIPELINE, AND TRANSFORMERS

345

The problem of performing this task is called the Two-Party Communication problem, and any solution protocol is called a communicator. A communicator must specify the operations of the sender and of the receiver. In particular, a communicator is composed of an encoding function, specifying how to encode the information into the communication sequence of packets and silence; a decoding function, specifying how to reconstruct the information from the communication sequence of packets and silence. Associated with any communicator are clearly two related cost measures: the total number of packets transmitted and the total number of clock ticks elapsed during the communication; as we will see, the study of the two-party communication problem in synchronous networks is really the study of the trade-off between time and transmissions. IMPORTANT. To simplify the discussion, in the following, we will consider that a packet contains just a single bit, that is, c = 1. Everything we will say is easily extendable to the case c > 1. 2-bit Communicators We have already seen the most well known communicator, Protocol TwoBits. This protocol, also known as C2 , belongs to a class of communicators called k-bit Communicators where the number of transmitted packets is a constant k ﬁxed a priori and known to both entities. In C2 , to communicate a positive integer i, the sender transmits two packets, b0 and b1 , waiting i time units between the two transmissions; the receiver computes the quantum of silence q1 between the two transmissions and decodes it as the information. In other words, the communication pattern is b0 : q1 : b1 . The encoding function is encode(i) = b0 : i : b1 and the decoding function is decode(b0 : q1 : b1 ) = q1 . Thus, the total amount of time from the time the sender starts the ﬁrst transmission to the time the receiver decodes the information is the quantum of silence plus the two time units used for transmitting the bits. Thus, the cost of the protocol is Cost[ C2 (i)] = 2, i + 2.

(6.5)

346

SYNCHRONOUS COMPUTATIONS

Hacking. We can improve the time complexity by exploiting the fact that the two transmitted bits b0 and b1 can be used to convey some information about i. In fact, it is possible to construct a communicator, called R2 , that communicates i transmitting 2 bits and only 2 + 4i time units (Exercise 6.6.6). Clearly, a better time complexity will be obtained if packets contain more than a single bit; that is, c > 1 (Exercise 6.6.7). 3-bit Communicators Let us examine what difference transmitting an extra packet has on the overall cost of communication. First of all, observe that with three packets b0 , b1 and b2 , we have two quanta of silence: the interval of time q1 between the transmission of b0 and b1 and the interval q2 between the transmission of b1 and b2 . In other words, the communication pattern is b0 : q1 : b1 : q2 : b2 . With this extra quantum √ to our disposal, consider the following strategy. If the sender could communicate i using a single quantum, the receiver can reconstruct i by squaring the received quantum, and the entire process will cost still 2 bits (to √delimit √ the quantum) but only i + 2 time ! The problem with this strategy is that i might not be an integer,√ while a quantum must be an integer. The sender can obviously use i , which is an integer, and the receiver can compute q12 , which, a quantum q1 = however, might be smaller than i. What the sender can do is to use the second quantum q2 to communicate how far q12 is from i, that is, q2 = i − q12 . In this way, the receiver is capable to reconstruct i: It simply computes q12 + q2 . In other words, the encoding function is encode(i) = b0 :

√ √ 2 i : b1 : i − i : b2 .

For example, encode(8, 425) = b0 : 91 : b1 : 144 : b2 . The decoding function is decode(b0 : q1 : b1 : q2 : b2 ) = q12 + q2 . The time required by this protocol is clearly q1 + q2 + 3; as x − we have q1 + q2 + 3 =

√ 2 √ x ≤2 x ,

√ √ 2 √ i +i− i + 3 ≤ 3 i + 3.

In other words, this protocol, called C3 , has sublinear time complexity. The resulting cost is √ Cost[C3 (i)] = 3 , 3 i + 3.

(6.6)

COMMUNICATORS, PIPELINE, AND TRANSFORMERS

347

FIGURE 6.8: Constructing the encoding of 33,703 when k = 5.

Hacking. We can improve the time complexity by exploiting the fact that the transmitted packets can be used to convey some information about i. In fact, it is possible to construct a communicator, called R3 , that communicates I transmitting 3 bits and √ only i + 3 time units (Exercise 6.6.8). Again, the more bits a packet contains, the better will be the time costs (Exercise 6.6.9). (2d + 1)-bit Communicators A solution protocol using k = 2d + 1 bits can be easily obtained extending the idea employed for k = 21 + 1 = 3. The encoding of i can be deﬁned recursively as follows: encoding (i) = b : E(I1 ) : b E(Ii ) =

E(I2i ) : b : E(I2i+1 ) if 1 < i < k − 1 quantum of length Ii if k − 1 ≤ i ≤ 2k − 3,

where √ I1 = i, I2i = Ii , and I2i+1 = Ii − I2i2 , and b is an arbitrary packet. So, for example, the encoding of i = 33, 703 when k = 5 is b 13 b 14 b 14 b 18 b (see Figure 6.8). To obtain i = I1 , the receiver will recursively compute Ii = I2i2 + I2i+1 . Exactly k − 1 quanta will be used, and k bits will be transmitted. The time costs will 1 be O(i k ) (Exercise 6.6.10). Optimal (k+1)-bit Communicators () When designing efﬁcient communicators, several questions arise naturally: How good are the communicators we have designed so far? In general, if we use k + 1 transmissions, what is the best time that can be achieved and which communicator will be able to achieve it? In this section we will answer these questions. We will design a general class of solution protocols and analyze their cost; we will then establish lower bounds and show that the proposed protocols achieve these bounds and are therefore optimal.

348

SYNCHRONOUS COMPUTATIONS

Our goal is now to design protocols that can communicate any positive integer I transmitting k + 1 packets and using as little time as possible. Observe that with k + 1 packets the communication sequence is b0 : q1 : b1 : q2 : b2 : . . . : qk : bk . We will ﬁrst of all make a distinction between protocols that do not care about the content of the transmitted protocols (like C2 and C3 ) and those (like R2 and R3 ) that use those packets to convey information about I . The ﬁrst class of protocols are able to tolerate the type of transmission failures called corruptions. In fact, they use packets only to delimit quanta; as it does not matter what the content of the packet is (but only that it is being transmitted), these protocols will work correctly even if the value of the bits in the packets is changed during transmission. We will call them as corruption-tolerant communicators. The second class exploits the content of the packets to convey information about I ; hence, if the value of just one of the bits is changed during transmission, the entire communication will become corrupted. In other words, these communicators need reliable transmission for their correctness. Clearly, the bounds and the optimal solution protocols are different for the two classes. We will consider the ﬁrst class in details; the second types of communicators will be brieﬂy sketched at the end. As before, we will consider for simplicity the case when a packet is composed of a single bit, that is c = 1; the results can be easily generalized to the case c > 1. Corruption-Tolerant Communication If transmissions are subject to corruptions, the value of the received packets cannot be relied upon, and so they are used only to delimit quanta. Hence, the only meaningful part of the communication sequence is the k−tuple of quanta q1 , q2 , . . . , qk . Thus, the (inﬁnite) set Qk of all possible k-tuples q1 , q2 , . . . , qk , where the qi are nonnegative integers, describes all the possible communication sequences. What we are going to do is to associate to each communication sequence Q[I ] ∈ Qk a different integer I . Then, if we want to communicate I , we will use the unique sequence of quanta described by Q[I ]. To achieve this goal we need a bijection between k-tuples and nonnegative integers. This is not difﬁcult to do; it is sufﬁcient to establish a total order among tuples as follows. Given two k-tuples Q = q1 , q2 , . . . , qk and Q = q1 , q2 , . . . , qk of positive integers, we say that Q < Q if 1. qi < i qi or

i

2. i qi = i qi and qj = qj for 1 ≤ j < l, and ql < ql for some index l, 1 ≤ l ≤ k + 1.

COMMUNICATORS, PIPELINE, AND TRANSFORMERS

349

I 0 1 2 3 4 5 6 7 8 9 10 Q[I] 0,0,0 0,0,1 0,1,0 1,0,0 0,0,2 0,1,1 0,2,0 1,0,1 1,1,0 2,0,0 0,0,3 11 12 13 14 15 16 17 18 19 20 21 22 0,1,2 0,2,1 0,3,0 1,0,2 1,1,1 1,2,0 2,0,1 2,1,0 3,0,0 0,0,4 0,1,3 0,2,2 23 24 25 26 27 28 29 30 31 32 33 34 0,3,1 0,4,0 1,0,3 1,1,2 1,2,1 1,3,0 2,0,2 2,1,1 2,2,0 3,0,1 3,1,0 4,0,0 FIGURE 6.9: The ﬁrst 35 elements of Q3 according to the total order.

That is, in this total order, all the tuples where the sum of the quanta is t are smaller than those where the sum is t + 1; so, for example 2, 0, 0 is smaller than 1, 1, 1. If the sum of the quanta is the same, the tuples are lexicographically ordered; so, for example, 1, 0, 2 is smaller than 1, 1, 1. The ordered list of the ﬁrst few elements of Q3 is shown in Figure 6.9. In this way, if we want to communicate integer I we will use the k-tuple Q whose rank (starting from 0) in this total order is I . So, for example, in Q3 , the triple 1, 0, 3 has rank 25, and the triple 0, 1, 4 corresponds to integer 36. The solution protocol, which we will call Orderk , thus uses the following encoding and decoding schemes. Protocol Orderk Encoding Scheme: Given I , the Sender (E1) ﬁnds Qk [I ] = a1 , a2 , . . . , ak ; (E2) it sets encoding(I ) := b0 : a1 : b1 : . . . , : ak : bk , where the bi are bits of arbitrary value. Decoding Scheme: Given (b0 : a1 : b1 : . . . , : ak : bk ), the receiver (D1) extracts Q = a1 , a2 , . . . , ak ; (D2) it ﬁnds I such that Qk [I ] = Q; (D3) it sets decoding(b0 : a1 : b1 : . . . , : ak : bk ): = I . The correctness of the protocol derives from the fact that the mapping we are using is a bijection. Let us examine the cost of protocol Orderk . The number of bits is clearly k + 1. B[Orderk ](I ) = k + 1.

(6.7)

What is the time? The communication sequence b0 : q1 : b1 : q2: b2 : . . . : qk : bk costs k + 1 time units spent to transmit the bits b0 , . . . , bk , plus ki=1 qi time

350

SYNCHRONOUS COMPUTATIONS

units of silence. Hence, to determine the time T [Orderk ](I ) we need to know the sum

t +k . of the quanta in Qk [I ]. Let f (I, k) be the smallest integer t such that I ≤ k Then (Exercise 6.6.12), T[Orderk ](I ) = f (I, k) + k + 1.

(6.8)

Optimality We are now going to show that protocol Orderk is optimal in the worst case. We will do so by establishing a lower bound on the amount of time required to solve the two-party communication problem using exactly k + 1 bit transmissions. Observe that k + 1 time units will be required by any solution algorithm to transmit the k + 1 bits; hence, the concern is on the amount of additional time required by the protocol. We will establish the lower bound assuming that the values I we want to transmit are from a ﬁnite set U of integers. This assumption makes the lower bound stronger because for inﬁnite sets, the bounds can only be worse. Without any loss of generality, we can assume that U = Zw = {0, 1, . . . , w − 1}, where |U | = w. Let c(w, k) denote the number of additional time units needed in the worst case to solve the two-party communication problem for Zw with k + 1 bits that can be corrupted during the communication. To derive a bound on c(w, k), we will consider the dual problem of determining the size ω(t, k) of the largest set for which the two-party communication problem can always be solved using k + 1 corruptible transmissions and at most t additional time units. Notice that with k + 1 bit transmissions, it is only possible to distinguish k quanta; hence, the dual problem can be rephrased as follows: Determine the largest positive integer w = ω(t, k) such that every x ∈ Zw can be communicated using k distinguished quanta whose total sum is at most t. This problem has an exact solution (Exercise 6.6.14): ω(t, k) =

t +k

k

.

(6.9)

This means that if U has size ω(t, k), then t additional time units are needed (in the worst case) by any communicator that uses k + 1 unreliable bits to communicate values of U . If the size of U is not precisely ω(t, k), we can still determine a bound. Let f (w, k) be the smallest integer t such that ω(t, k) ≥ w. Then c(w, k) = f (w, k).

(6.10)

COMMUNICATORS, PIPELINE, AND TRANSFORMERS

351

That is Theorem 6.2.1 Any corruption-tolerant solution protocol using k + 1 bits to communicate values from Zw requires f (w, k) + k + 1 time units in the worst case. In conjunction with Equation 6.8, this means that protocol Orderk is a worst case optimal. We can actually establish a lower bound on the average case as well (Exercise 6.6.15), and prove (Exercise 6.6.16) that protocol Orderk is average-case optimal Corruption-Free Communication () If bit transmissions are error free, the value of a received packet can be trusted. Hence it can be used to convey information about the value I the sender wants to communicate to the receiver. In this case, the entire communication sequence, bits and quanta, is meaningful. What we do is something similar to what we just did in the case of corruptible bits. We establish a total order on the set Wk of the 2k + 1 tuples b0 , q1 , b1 , q2 , b2 , . . . , qk , bk corresponding to all the possible communication sequences. In this way, each tuple 2k + 1-tuple W [i] ∈ Wk has associated a distinct integer: its rank i. Then, if we want to communicate I , we will use the communication sequence described by W [I ]. In the total order we choose, all the tuples where the sum of the quanta is t are smaller than those where the sum is t + 1; so, for example, in W2 , 1, 2, 1, 0, 1 is smaller than 0, 0, 0, 3, 0. If the sum of the quanta is the same, tuples (bits and quanta) are lexicographically ordered; so, for example, in W2 , 1, 1, 1, 1, 1 is smaller than 1, 2, 0, 0, 0. of The resulting protocol is called Order+k . Let us examine its costs. The number

t +k bits is clearly k + 1. Let g(I, k) be the smallest integer t such that I ≤ 2k+1 . k Then (Exercise 6.6.13), B[Order+k ](I ) = k + 1

(6.11)

T[Order+k ](I ) = g(I, k) + k + 1.

(6.12)

Also, protocol Order+k is worst-case and average-case optimal (see exercises 6.6.17, 6.6.18, and 6.6.19). Other Communicators The protocols Orderk and Order+k belong to the class of k + 1-bit communicators where the number of transmitted bits is ﬁxed a priori and known to both the entities. In this section, we consider arbitrary communicators, where the number of bits used in the transmission might not be not predetermined (e.g., it may change depending on the value I being transmitted).

352

SYNCHRONOUS COMPUTATIONS

With arbitrary communicators, the basic problem is obviously how the receiver can decide when a communication has ended. This can be achieved in many different ways, and several mechanisms are possible. Following are two classical ones: Bit Pattern. The sender uses a special pattern of bits to notify the end of communication. For example, the sender sets all bits to 0, except the last, which is set to 1; the drawback with this approach is that the bits cannot be used to convey information about I . Size Communication. As part of the communication, the sender communicates the total number of bits it will use. For example, the sender uses the ﬁrst quantum to communicate the number of bits it will use in this communication; the drawback of this approach is that the ﬁrst quantum cannot be used to convey information about I . We now show that, however ingenious the employed mechanism be, the results are not much better than those obtained just using optimal k + 1-bit communicators. In fact, an arbitrary communicator can only improve the worst-case complexity by an additive constant. This is true even if the receiver has access to an oracle revealing (at no cost) for each transmission the number of bits the sender will use in that transmission. Consider ﬁrst the case of corruptible transmissions. Let γ (t, b) denote the size of the largest set for which an oracle-based communicator uses at most b corruptible bits and at most t + b time units. Theorem 6.2.2 γ (t, b) < ω(t + 1, b) Proof. As up to k + 1 corruptible by Equation

6.9, bits can be transmitted,

t +j t +k+1 t +1+k k k γ (t, b) = j =1 ω(t, j ) = j =1 = −1< j k k = ω(t + 1, b). 䊏 This implies that, in the worst case, communicator Orderk requires at most one time unit more than any strategy of any type which uses the same maximum number of corruptible bits. Consider now the case of incorruptible transmissions. Let α(t, b) denote the size of the largest set for which an oracle-based communicator uses at most b reliable bits and at most t + b time units. To determine a bound on α(t, b), we will ﬁrst consider the size β(t, k) of the largest set for which a communicator without an oracle uses always at most b reliable bits and at most t + b time units. We know (Exercises 6.6.17) that

t +k k+1 Lemma 6.2.1 β(t, k) = 2 . k From this, we can now derive Theorem 6.2.3 α(t, b) < β(t + 1, b).

COMMUNICATORS, PIPELINE, AND TRANSFORMERS

353

Proof. As up to k + 1 incorruptible bits can be transmitted, α(t, b) = kj =1 β(t, j ).

t +j t +1+k k k j +1 k+1 I2 , x2 will ﬁnish waiting its value before this message arrives. In this case, x2 will wait until it receives “Stop-Counting” signal from x1 , and then forward it. Thus, the “Stop-Counting” signal will be sent to x3 at the correct time t + 1 + I1 = t + 1 + Max{I1 , I2 } = t . That is, x2 will always send Max{I1 , I2 } in time to x3 . The same reasoning we just used to understand how x2 can know Max{I1 , I2 } in time can be applied to verify that indeed each xj can know Max{I1 , I2 , . . . , Ij −1 } in time (Exercise 6.6.23). An example is shown in Figure 6.12. We have described the solution using TwoBits as the communicator. Clearly any communicator C can be used, provided that its encoding is monotonically increasing,

FIGURE 6.12: Time–Event diagram showing the computation of the largest value in pipeline.

COMMUNICATORS, PIPELINE, AND TRANSFORMERS

357

that is, if I > J , then in C the communication sequence for I is lexicographically smaller than that for J . Note that protocols Orderk and Order+k are not monotonically increasing; however, it is not difﬁcult to redeﬁne them so that they have such a property (Exercises 6.6.21 and 6.6.22). The total number of bits will then be (p − 1) Bits(C, Imax ),

(6.15)

the same as that without pipeline. The time instead is at most (p − 1) + Time(C, Imax ).

(6.16)

Once again, the number of bits is the same as that without pipeline; the time costs are instead greatly reduced: The factor (p − 1) is additive and not multiplicative. Similar reductions in time can be obtained for other computations, such as computing the minimum value (Exercise 6.6.24), the sum of the values (Exercise 6.6.25), and so forth. The approach we used for these computations in a chain can be generalized to arbitrary tree networks; see for example Problems 6.6.5 and 6.6.6. 6.2.3 Transformers Asynchronous-to-Synchronous Transformation The task of designing a fully synchronous solution for a problem can be easily accomplished if there is already a known asynchronous solution A for that problem. In fact, since A makes no assumptions on time, it will run under every timing condition, including the fully synchronous ones. Its cost in such a setting would be the number of messages M(A) and the “ideal” time T (A). Note that this presupposes that the size m(A) of the messages used by A is not greater than the packet size c (otherwise, the message must be broken into several packets, with a corresponding increasing message and time complexity). We can actually exploit the availability of an asynchronous solution protocol A in a more clever way and with a more efﬁcient performance than just running A in the fully synchronous system. In fact, it is possible to transform any asynchronous protocol A into an efﬁcient synchronous one S, and this transformation can be done automatically. This is achieved by an asynchronous-to-synchronous transformer (or just transformer), a “compiler” that, given in input an asynchronous protocol solving a problem P, generates an efﬁcient synchronous protocol solving P. The essential component of a transformer is the communicator. Let C be a universal communicator (i.e., a communicator that works for all positive integers). An asynchronous-to-synchronous transformer τ [C] is obtained as follows. Transformer τ [C] Given any asynchronous protocol A, replace the asynchronous transmission-reception of each message in A by the communication, using C, of the information contained in that message.

358

SYNCHRONOUS COMPUTATIONS

In other words, we replace each “send message” instruction in algorithm A by an instruction “communicate content of message,” where the communication is performed using the communicator C. It is not difﬁcult to verify that if A solves problem P for a class G of system topologies (i.e., graphs), then τ [C](A) = S is a fully synchronous protocol that solves P for the graphs in G. Note that in a practical implementation, we must take care of several details (e.g., overlapping arrival of messages) that we are not discussing here. Let us calculate now the cost of the obtained protocol S = τ [C](A) in a graph G ∈ G ; let M(A), Tcasual (A), and m(A) denote the message complexity, the causal time complexity, and the size of the largest message, respectively, of A in G. Recall that the causal time complexity is the length of the longest chain of causally related message transmissions over all possible executions. For some protocols, it might be difﬁcult to determine the causal time; however, we know that Tcasual (A) ≤ M(A); hence we always have an upperbound. In the transformation, the transmission (and corresponding reception) of I in A is replaced by the communication of I using communicator C; this communication requires Time(C, I ) time and Packets(C, I ) packets. As at most Tcasual (A) messages must be sent sequentially (i.e., one after the other) and I ≤ 2m(A) , the total number of clock ticks required by S will be Time(S) ≤ Tcasual (A) × Time(C, 2m(A) ).

(6.17)

As the information of each of the M(A) messages must be communicated, and the messages have size at most m(A), the total number of packets P(S) transmitted by the synchronous protocol S is just P(S) ≤ M(A) × Packets(C, m(A)).

(6.18)

In other words, Lemma 6.2.2 Transformation Lemma For every universal communicator C there exists an asynchronous-to-synchronous transformer τ [C]. Furthermore, for every asynchronous protocol A, the packet-time cost of τ [C](A) is at most Cost[ τ [C](A) ] ≤ M(A) Packets(C, m(A)) , Tcasual (A) Time(C, 2m(A) ). This simple transformation mechanism might appear to yield inefﬁcient solutions for the synchronous case. To dispel this false appearance, we will consider an interesting application. Application: Election in a Synchronous Ring Consider the problem of electing a leader in a synchronous ring. We assume the standard restrictions for elections (IR), as well as Synch. We have seen several efﬁcient election algorithms for asynchronous ring networks in previous chapters. Let us choose one and examine the effects of the transformer.

COMMUNICATORS, PIPELINE, AND TRANSFORMERS

359

Consider protocol Stages. Recall that this protocol uses M(Stages) = 2n log n + O(n); each message contains a value; hence, m(Stages) = log i, where i is the range of the input values; regarding the causal time, as Tcasual (A) ≤ M(A) for every protocol A, we have Tcasual (Stages) ≤ 2n log n + O(n). To apply the Transformation Lemma, we need to choose a universal communicator. Let us choose a not very efﬁcient one: TwoBits; recall that the cost of communicating integer I is 2 bits and I + 2 time units. Let us now apply the transformation lemma. We then have a new election protocol SynchStages= τ [TwoBits](Stages) for synchronous ring; as Time(TwoBits, 2m(Stages) ) = 2log i + 2 = i + 2, by Lemma 6.2.2, we have T(SynchStages) ≤ 2n log(n) (i + 2) + l.o.t

(6.19)

B(SynchStages) = 2M(Stages) ≤ 2n log(n) + O(n).

(6.20)

and

This result must be compared with the bounds of the election algorithm Speed speciﬁcally designed for synchronous systems (see Figure 6.13): The transformation lemma yields bounds that are order of magnitude better than those previously obtained by speciﬁcally designed algorithm. Once we have obtained a solution protocol using a transformer, both the bits and the time complexity of this solution depend on the communicator employed by the transformer. Sometimes, the time complexity can be further reduced without increasing the number of bits by using pipeline. For example, during every stage of protocol Stages and thus of protocol SynchStages, the information from each candidate must reach the neighboring candidate on each side. This operation, as we have already seen, can be efﬁciently done in pipeline, yielding a reduction in time costs (Exercise 6.6.26). Design Implications The transformation lemma gives a basis of comparison for designing efﬁcient synchronous solutions to problems for which there already exist asynchronous solutions. To improve on the bounds obtained by the use of the transformation lemma, it is necessary to more explicitly and cleverly exploit the availability of “time” as a computational tool. Some techniques that achieve this goal for some speciﬁc problems are described in the next sections. Protocol Speed SynchStages

Bits O(n log i) O(n log n)

Time O(2i n) O(i n log n)

FIGURE 6.13: The transformer yields a more efﬁcient ring election protocol

360

SYNCHRONOUS COMPUTATIONS

When designing a protocol, our aim must be to avoid the transmission of unbounded messages; in particular, if the input values are drawn from some unbounded universe (e.g., positive integers) and the goal of the computation is the evaluation of a function of the input values, then the messages cannot contain such values. For example, the “trick” on which the transformation lemma is based is an instance of a simple and direct way of exploiting time by counting it; in this case, the actual value is communicated but not transmitted. 6.3 MIN-FINDING AND ELECTION: WAITING AND GUESSING Our main goal as protocol designers is to exploit the fact that in synchronous systems, time is an explicit computational tool, so as to develop efﬁcient solutions for the assigned task or problem. Let us consider again two problems that we have extensively studied for asynchronous networks: minimum-ﬁnding and election. We assume the standard restrictions for minimum-ﬁnding (R), as well as Synch; in the case of election, we obviously assume Initial Distinct Values (ID) also. We have already seen a solution protocol, Speed, designed for synchronous ring networks; we have observed how its low message costs came at the expense of a time complexity that is exponential in the range of the input values. The Transformation Lemma provides a tool that automatically produces a synchronous solution when an asynchronous one is already available. We have seen how the use of a transform leads to an election protocol for rings, SynchStages, with reduced bits and time costs. By integrating pipeline, we can obtain further improvements. The cost of minimum-ﬁnding and election can be signiﬁcantly reduced by using other types of “temporal” tools and techniques. In this section, we will describe two basic techniques that make an explicit use of time, waiting and guessing. We will describe and use them to efﬁciently solve MinFinding and Election in rings and other networks. 6.3.1 Waiting Waiting is a technique that uses time not to transmit a value (as in the communicators), but to ensure that a desired condition is veriﬁed. Waiting in Rings Consider a ring network where each entity x has as initial value a positive integer id(x). Let us assume, for the moment, that the ring is unidirectional and that all entities start at the same time (i.e., simultaneous initiation). Let us further assume that the ring size n is known. The way of ﬁnding the minimum value using waiting is surprisingly simple. What an entity x will initially do is nothing, but just wait. More precisely, Waiting 1. The entity x waits for a certain amount of time f (id(x), n). 2. If nothing happens during this time, the entity determines “I am the smallest” and sends a “Stop” message.

MIN-FINDING AND ELECTION: WAITING AND GUESSING

361

3. If, instead, while waiting the entity receives a “Stop” message, it determines “I am not the smallest” and forwards the message. With the appropriate choice of the waiting function f , this surprisingly simple protocol works correctly! To make the process work correctly, the entities with the smallest value must ﬁnish waiting before anybody else does (in this way, each of them will correctly determine “I am the minimum”). In other words, the waiting function f must be monotonically decreasing: if id(x) < id(y) then f (id(x), n) < f (id(y, n)). This is, however, not sufﬁcient. In fact, it is also necessary that every entity whose value is not the smallest receives a “Stop” message while still waiting (in this way, each of them will correctly determine “I am not the minimum”). To achieve this, it is necessary that if x originates a “Stop” message, this message would reach every entity y with id(x) < id(y) while y is still waiting, that is, if id(x) < id(y), then f (id(x), n) + d(x, y) < f (id(y), n),

(6.21)

where d(x, y) denotes the distance of y from x in the ring. This must hold regardless of the distance d(x, y) and regardless of how small id(y) is (provided id(y) > id(x)). As d(x, y) ≤ n − 1 for every two entities in the ring, and the smallest value larger than id(x) is clearly id(x) + 1, any function f satisfying the following inequality

f (0)

=0

f (v, n) + n − 1 < f (v + 1, n)

(6.22)

will make protocol Wait function correctly. Such is, for example, the waiting function f (i, n) = i n.

(6.23)

As an example, consider the ring topology shown in Figure 6.14(a) where n = 6. The entities with the smallest value, 3, will ﬁnish waiting before all others: After 6 × 3 = 18 units of time they send a message along the ring. These messages travel along the ring encountering the other entities while they are still waiting, as shown in Figure 6.14(b). IMPORTANT. Protocol Wait solves the minimum-ﬁnding problem, not the election: Unless we assume initial distinct values, more than one entity might have the same smallest value, and they will all correctly determine that they are the minimum.

362

SYNCHRONOUS COMPUTATIONS

FIGURE 6.14: (a) The time when an entity x would ﬁnish waiting; (b) the messages send by the entities with value 3 at time 6 × 3 = 18 reach the other entities while they are still waiting.

As an example of execution of waiting under the (ID) restriction, consider the ring topology shown in Figure 6.15 where n = 6, and the values outside the nodes indicate how long each entity would wait. The unique entity with the smallest value, 3, will be elected after 6 × 3 = 18 units of time. Its “Stop” message travels along the ring encountering the other entities while they are still waiting.

FIGURE 6.15: Execution with Initial Distinct Values: a leader is elected.

MIN-FINDING AND ELECTION: WAITING AND GUESSING

Protocol Speed SynchStages Wait

Bits O(n log i) O(n log n) O(n)

Time O(2i n) O(i n log n) O(i n)

363

Notes

n known

FIGURE 6.16: Waiting yields a more efﬁcient ring election protocol

What is the cost of such a protocol? Only an entity that becomes minimum originates a message; this message will only travel along the ring (forwarded by the other entities that become large) until the next minimum entity. Hence the total number of messages is just n; as these messages are signals that do not contain any value, we have that Wait uses only O(n) bits. This is the least amount of transmissions possible ever. Let us consider the time. It will take f (imin , n) = imin n time units for the entities with the smallest value to decide that they are the minima; at most, n − 1 additional time units are needed to notify all others. Hence, the time is O(i, n), where i is the range of the input values. Compared with the other protocols we have seen for election in the ring, Speed and SynchStages, the bit complexity is even better (see Figure 6.16). Without Simultaneous Initiation We have derived this surprising result assuming that the entities start simultaneously. If the entities can start at any time, it is possible that an entity with a large value starts so much before the others that it will ﬁnish waiting before the others and incorrectly determine that it is the minimum. This problem can be taken care of by making sure that although the entities do not start at the same time, they will start not too far away (in time) from each other. To achieve this, it is sufﬁcient to perform a wake-up: When an entity spontaneously wants to start the protocol, it will ﬁrst of all send a “Start” message to its neighbor and then start waiting. An inactive entity will become active upon receiving the “Start” message, forward it, and start its waiting process. Let t(x) denote the time when entity x becomes awake and starts its waiting process; then, for any two entities x and y, ∀x, y t(y) − t(x) ≤ d(x, y);

(6.24)

in particular, no two entities will start more than n − 1 clock ticks off from each other. The waiting function f must now take into account this fact. As before, it is necessary that if id(x) < id(y), then x must ﬁnish waiting before y and its message should reach y while still waiting; but now this must happen regardless of at what time t(x) entity x starts and at what time t(y) entity y starts; that is, if id(x) < id(y), t(x) + f (id(x), n) + d(x, y) < t(y) + f (id(y), n).

(6.25)

364

SYNCHRONOUS COMPUTATIONS

As d(x, y) < n for every two entities in the ring, by Equation 6.24, and by setting f (0) = 0, it is easy to verify that any function f satisfying the inequality

f (0) =0 f (v, n) + 2n − 1 < f (v + 1, n)

(6.26)

will make protocol Wait function correctly even if the entities do not start simultaneously. Such is, for example, the waiting function f (v, n) = 2 n v.

(6.27)

The cost of the protocol is slightly bigger, but the order of magnitude is the same. In fact, in terms of bits we are performing also a wake-up that, in a unidirectional ring, costs n bits. As for the time, the new waiting function is just twice as the old one; hence, the time costs are at most doubled. In other words, the costs are still those indicated in Figure 6.16. In Bidirectional Rings We have considered unidirectional rings. If the ring is bidirectional, the protocol requires marginal modiﬁcations, as shown in Figure 6.17. The same costs as the unidirectional case can be achieved with the same waiting functions. On the Waiting Function We have assumed that the ring size n is known to the entities; it is indeed used in the requirements for waiting functions (Expressions 6.22 and 6.26). An interesting feature (Exercise 6.6.31) is that those requirements would work even if a quantity n is used instead of n, provided n ≥ n. Hence, it is sufﬁcient that the entities know (the same) upperbound n on the network size. If the entities have all available a value n that is, however, smaller than n, its use in a waiting function instead of n would in general lead to incorrect results. There is, however, a range of values for n that would still guarantee the desired result (Exercise 6.6.32). A ﬁnal interesting observation is the following. Consider the general case when the entities have available neither n nor a common value n, that is, each entity only knows its initial value id(x). In this case, if each entity uses in the waiting function its value id(x) instead of n, the function would work in some cases, for example, when all initial values id(x) are not smaller than n. See Exercise 6.6.33. Universal Waiting Protocol The waiting technique we have designed for rings is actually much more general and can be applied in any connected network G, regardless of its topology. It is thus a universal protocol. The overall structure is as follows: 1. First a reset is performed with message “Start.” 2. As soon as an entity x is active, it starts waiting f (id(x), n) time units.

MIN-FINDING AND ELECTION: WAITING AND GUESSING

365

PROTOCOL Wait

States: S = {ASLEEP, CANDIDATE, LARGE, MINIMUM}; SINIT = {ASLEEP}; STERM = {LARGE, SMALL}.

Restrictions: R ∪ Synch ∪ Ring ∪ Known(n). ASLEEP Spontaneously begin set alarm:= c(x) + f (id(x),n); send("Start") to right; direction := right; become CANDIDATE; end Receiving("Start") begin set alarm:= c(x) + f (id(x),n); send("Start") to other; direction := other; become CANDIDATE; end CANDIDATE W hen(c(x) = alarm) begin send("Over") to direction; become MINIMUM; end Receiving("Over") begin send("Over") to other; become LARGE; end

FIGURE 6.17: Protocol Wait.

3. If, nothing happens while x is waiting, x determines that “I am the minimum” and initiates a reset with message “Stop.” 4. If, instead, a “Stop” message arrives while x is waiting, then it stops its waiting, determines that “I am not the minimum” and participates in the reset with message “Stop.” Again, regardless of the initiation times, it is necessary that the entities with the smallest value ﬁnish waiting before the entities with larger value and that all those other entities receive a “Stop” message while still waiting. That is, if id(x) < id(y), then t(x) + f (id(x)) + dG (x, y) < t(y) + f (id(y)),

366

SYNCHRONOUS COMPUTATIONS

where dG (x, y) denotes the distance between x and y in G, and t(x) and t(y) are the times when x and y start waiting. Clearly, for all x, y, |t(x) − t(y)| ≤ dG (x, y); hence, setting f (0) = 0, we have that any function satisfying

f (0) =0 f (v) + 2dG < f (v + 1)

(6.28)

makes the protocol correct, where dG is the diameter of G. This means that, for example, the function f (v) = 2 v (dG + 1)

(6.29)

would work. As n − 1 ≥ dG for every G, this also means that the function f (v) = 2 v n we had determined for rings actually works in every network; it might not be the most efﬁcient though (Exercises 6.6.29 and 6.6.30). Applications of Waiting We will now consider two rather different applications of protocol Wait. The ﬁrst is to compute two basic Boolean functions, AND and OR; the second is to reduce the time costs of protocol Speed that we discussed earlier in this chapter. In both cases we will consider unidirectional ring for the discussion; the results, however, trivially generalize to all other networks. In discussing these applications, we will discover some interesting properties of the waiting function. Computing AND and OR Consider the situation where every entity x has a Boolean value b(x) ∈ {0, 1}, and we need to compute the AND of all those values. Assume as before that the size n of the ring is known. The AND of all the values will be 1 if and only if ∀x b(x) = 1, that is, all the values are 1; otherwise the result is 0. Thus, to compute AND it sufﬁces to know if there is at least one entity x with value b(x) = 0. In other words, we just need to know whether the smallest value is 0 or 1. With protocol Waiting we can determine the smallest value. Once this is done, the entities with such a value know the result. If the result of AND is 1, all the entities have value 1 and are in state minimum, and thus know the result. If the result of AND

MIN-FINDING AND ELECTION: WAITING AND GUESSING

367

is 0, the entities with value 0 are in state minimum (and thus know the result), while the others are in state large (and thus know the result). Notice that if an entity x has value b(x) = 0, using the waiting function of expression 6.27, its waiting time will be f (b(x)) = 2 b(x) n = 0. That is, if an entity has value 0, it does not wait at all. To determine the cost of the overall protocol is quite simple (Exercise 6.6.35). In a similar way we can use protocol Waiting to compute the OR of the input values (Exercise 6.6.36).

Reducing Time Costs of Speed The ﬁrst synchronous election protocol we have seen for ring networks is Speed, discussed in Section 6.1.4. (NOTE: to solve the election problem it assumes initial distinct values.) On the basis of the idea of messages traveling along the ring at different speeds, this protocol has unfortunately a terrifying time complexity: exponential in the (a priori unbounded) smallest input value imin (see Figure 6.16). Protocol Waiting has a much better complexity, but it requires knowledge of (an upperbound on) n; on the contrary, protocol Speed requires no such knowledge. It is possible to reduce the time costs of Speed substantially by adding Waiting as a preliminary phase. As each entity x knows only its value id(x), it will ﬁrst of all execute Waiting using 2id(x)2 as the waiting function. Depending on the relationship between the values and n, the Waiting protocol might work (Exercise 6.6.33), determining the unique minimum (and hence electing a leader). If it does not work (a situation that can be easily detected; see Exercise 6.6.34), the entities will then use Speed to elect a leader. The overall cost of this combine protocol Wait + Speed clearly depends on whether the initial Waiting succeeds in electing a leader or not. If Waiting succeeds, we will not execute Speed and the cost will just be O(i2min ) time and O(n) bits. If Waiting does not succeed, we must also run Speed that costs O(n) messages i ) time. So the total cost will be O(n) messages and O(i2 + n2imin ) = but O(n2min min O(n2imin ) time. However, if Waiting does not succeed, it is guaranteed that the smallest initial value is at most n, that is imin < n (see again Exercise 6.6.33). This means that the overall time cost will be only O(n2n ). In other words, whether the initial Waiting succeeds or not, protocol Wait+Speed will use O(n) messages. As for the time, it will cost either O(i2min ) or O(n2n ), depending on whether the waiting succeeds or not. Summarizing, using Waiting we can reduce the time complexity of Speed from O(n2i ) to O( Max{i2 , n2n } ) adding at most O(n) bits.

368

SYNCHRONOUS COMPUTATIONS

Application: Randomized Election If the assumption on the uniqueness of the identities does not hold, the election problem cannot be solved obviously by any minimum-ﬁnding process, including Wait. Furthermore, we have already seen that if the nodes have no identities (or, analogously, all have the same identity), then no deterministic solution exists for the election problem, duly renamed symmetry breaking problem, regardless of whether the network is synchronous or not. This impossibility result applies to deterministic protocols, that is, protocols where every action is composed only of deterministic operations. A different class of protocols are those where an entity can perform operations whose result is random, for example, tossing a dice, and where the nature of the action depends on outcome of this random event. For example, an entity can toss a coin and, depending on whether the result is “head” or “tail,” perform a different operation. These types of protocols will be called randomized; unlike their deterministic counterparts, randomized protocols give no guarantees, either on the correctness of their result or on the termination of their execution. So, for example, some randomized protocols always terminate but the solution is correct only with a given probability; this type of protocols is called Monte Carlo. Other protocols will have the correct solution if they terminate, but they terminate only with a given probability; this type of protocols are called Las Vegas. We will see how protocol Wait can be used to generate a surprisingly simple and extremely efﬁcient Las Vegas protocol for symmetry breaking. Again we assume that n is known. We will restrict the description to unidirectional rings; the results can, however, be generalized to several other topologies (Exercises 6.6.37-6.6.39). 1. The algorithm is composed of a sequence of rounds. 2. In each round, every entity randomly selects an integer between 0 and b as its identity, where b ≤ n. 3. If the minimum of the chosen values is unique, that entity will become leader; otherwise, a new round is started. To make the algorithm work, we need to design a mechanism to ﬁnd the minimum and detect if it is unique. But this is exactly what protocol Wait does. In fact, protocol Wait not only ﬁnds the minimum value but also allows an entity x with such a value to detect if it is the only one. In fact, – If x is the only minimum, its message will come back exactly after n time units; in this case, x will become leader and send a Terminate message to notify all other entities. – If there are more than one minimum, x will receive a message before n time units; it will then send a “Restart” message and start the next round. In other words, each round is an execution of protocol Wait; thus, it costs O(n) bits, including the “Restart” (or “Termination”) messages. The time used by protocol Wait is O(ni). In our case the values are integers between 0 and b, that is, i≤ b. Thus, each round will cost at most O(nb) time.

MIN-FINDING AND ELECTION: WAITING AND GUESSING

369

We have different options with regard to the value b and how the random choice of the identities is made. For example, we can set b = n and choose each value with same probability (Exercise 6.6.40); notice, however, that the larger the b is, the larger the time costs of each round will be. We will use instead b = 1 (i.e., each entity randomly chooses either 0 or 1) and employ a biased coin. Speciﬁcally, in our protocol, which we will call Symmetry, we will employ the following criteria: Random Selection Criteria In each round, every entity selects 0 with probability 1 , and 1 with probability n−1 n n . Up to now, except for the random selection criteria, there has been little difference between Symmetry and the deterministic protocols we have seen so far. This is going to change soon. Let us compute the number of rounds required by the protocol until termination. The surprising thing is that this protocol might never terminate, and thus the number of rounds is potentially inﬁnite. In fact, with a protocol of type Las Vegas, we know that if it terminates, it solves the problem, but it might not terminate. This is not a good news for those looking for protocols with a guaranteed performance. The advantage of this protocol is instead in the low expected number of rounds before termination. Let us compute this quantity. Using the random selection criteria described above, the protocol terminates as soon as exactly one entity chooses 0. For this to happen, one entity x must choose 0 (this happens with probability n1 ), while the other n − 1

n n−1 n−1 entities must choose 1 (this happen with probability ( n ) ). As there are =n 1 choices for x, the probability of exactly one entity chooses 0 is

n 1

1 n−1 n−1 n( n )

n−1 . = ( n−1 n )

For n large enough, this quantity is easily bounded; in fact lim

n→∞

n−1 n

n−1 =

1 , e

(6.30)

where e ≈ 2.7 . . . is the basis of the natural logarithm. This means that with probability 1, protocol Symmetry will terminate after e rounds. In other words, with probability 1, protocol Symmetry will elect a leader with O(n) bits in O(n) time. Obviously, there is no guarantee that a leader will be elected with this cost or will be elected at all, but with high probability it will and at that cost. This shows the unique nature of randomized protocols.

370

SYNCHRONOUS COMPUTATIONS

6.3.2 Guessing Guessing is a technique that allows some entities to determine a value not by transmitting it but by guessing it. Again we will consider the minimum ﬁnding and election problems in ring networks. Let us assume, for the moment, that the ring is unidirectional and that all entities start at the same time (i.e., simultaneous initiation). Let us further assume that the ring size n is known.

Minimum-Finding as a Guessing Game At the base of the guessing technique there is a basic utility protocol Decide(p), where p is a parameter available to all entities. Informally, protocol Decide(p) is as follows: Decide (p): Every entity x compares its value id(x) with the protocol parameter p. If id(x) ≤ p, x sends a message; otherwise, it will forward any received message. There are only two possible situations and outcomes: S1: All local values are greater than p; in this case, no messages will be transmitted: There will be “silence” in the system. S2: At least one entity x has id(x) ≤ p ; in this case, every entity will send and receive a message: There will be “noise” in the system. The goal of protocol Decide is to make all entities know in which of the two situations we are. Let us examine how an entity y can determine whether we are in situation S1 or S2. If id(y) ≤ p, then y knows immediately that we are in situation S2. However, if id(y) > p, then y does not know whether all the entities have values greater than p (situation S1) or some entities have a value smaller than or equal to p (situation S2). It does know that if we are in situation S2, it will eventually receive a message; by contrast, if we are in situation S1, no message will ever arrive. Clearly, to decide, y must wait; also clearly, it cannot wait forever. How long should y wait? The answer is simple: If a message was sent by an entity, say x, a message will arrive at y within at most d(x, y) < n time units from the time it was sent. Hence, if y does not receive any message in the ﬁrst n time units since the start, then none is coming and we are in situation S1. For this reason, n time units after the entities (simultaneously) start the execution of protocol Decide(p), all the entities can decide which situation (S1 or S2) has occurred. The full protocol is shown in Figure 6.18. IMPORTANT. Consider the execution of Decide(p). – If situation S1 occurs, it means that all the values, including imin = Min{id(x)}, are greater than p, that is, p < imin . We will say that p is an underestimate on imin . – If situation S2 occurs, it means that there are some values that are not greater than imin ; thus, p ≥ imin . We will say that p is an overestimate on imin .

MIN-FINDING AND ELECTION: WAITING AND GUESSING

371

SUBPROTOCOL Decide(p)

Input: positive integer p; States: S = {START, DECIDED, UNDECIDED}; SINIT = {START}; STERM = {DECIDED}.

Restrictions: R ∪ Synch ∪ Ring ∪ Known(n) ∪ Simultaneous Start. START Spontaneously begin set alarm:= c(x) + n; if id(x) ≤ v then decision:= high; send("High") to rigth; become DECIDED; else become UNDECIDED; endif end UNDECIDED Receiving("High") begin decision:= high; send("High") to other; become DECIDED; end W hen(c(x) = alarm) begin decision:= low; become DECIDED; end

FIGURE 6.18: SubProtocol Decide(p).

These observations are summarized in Figure 6.19. NOTE. The condition p = imin is also considered an overestimate. Using this fact, we can reformulate the minimum-ﬁnding problem in terms of a guessing game: Each entity is a player. The minimum value imin is a number, previously chosen and unknown to the player, that must be guessed. The player can ask question of type “Is the number greater than p?”

Situation S1 S2

Condition p < imin p ≥ imin

Name “underestimate” “overestimate”

Time n n

Bits 0 n

FIGURE 6.19: Results and costs of executing protocol Decide.

372

SYNCHRONOUS COMPUTATIONS

Each question corresponds to a simultaneous execution of Decide(p). Situations S1 and S2 correspond to a "YES" and a "NO" answer to the question, respectively. A guessing protocol will just specify which questions should be asked to discover imin . Initially, all entities choose the same initial guess p1 and simultaneously perform Decide(p1 ). After n time units, all entities will be aware of whether or not imin is greater

than p1 (situation S1 and situation S2, respectively). On the basis of the outcome, a new guess p2 will be chosen by all entities that will then simultaneously perform Decide(p2 ). In general, on the basis of the outcome of the execution of Decide(pi ), all entities will choose a new guess pi+1 . The process is repeated until the minimum value imin is unambiguously determined. Depending on which strategy is employed for choosing pi+1 given the outcome of Decide(pi ), different minimum-ﬁnding algorithms will result from this technique. Before examining how to best play (and win) the game, let us discuss the costs of asking a question, that is, of executing protocol Decide. Observe that the number of bits transmitted when executing Decide depends on the situation, S1 or S2, we are in. In fact in situation S1, no messages will be transmitted at all. By contrast, in situation S2, there will be exactly n messages; as the content of these messages is not important, they can just be single bits. Summarizing, If our guess is an overestimate, we will pay n bits; if it is an underestimate, it will cost nothing. As for the time costs, each execution of Decide will cost n time units regardless of whether it is an underestimate or overestimate. This means that we pay n time units for each question; however, we pay n bits only if our guess is an overestimate. See Figure 6.19. Our goal must, thus, be to discover the number, asking few questions (to minimize time) of which as few as possible are overestimates (to minimize transmission costs). As we will see, we will unfortunately have to trade off one cost for the other. We will ﬁrst consider a simpliﬁed version of the game, in which we know an upperbound M on the number to be guessed, that is, we know that imin ∈ [1, M] (see Figure 6.20). We will then see how to easily and efﬁciently establish such a bound. Playing the Game We will now investigate how to design a successful strategy for the guessing game. The number imin to be guessed is known to be in the interval [1, M] (see Figure 6.20). Let us denote by q the number of questions and by k ≤ q the number of overestimates used to solve the game; this will correspond to a minimum-ﬁnding protocol that uses qn time and kn bits. As each overestimate costs us n bits, to design an overall

FIGURE 6.20: Guessing in an interval.

MIN-FINDING AND ELECTION: WAITING AND GUESSING

373

FIGURE 6.21: Linear search is the only possibility when k = 1.

strategy that uses only O(n) bits in total (like we did with protocol Waiting), we must use only a constant (i.e., O(1)) number of overestimates; clearly, we want to use as few questions as possible. Let us ﬁrst solve the problem with k = 1, that is, we want to ﬁnd the minimum with only one overestimate. As the number (i.e., when p = imin ) is already an overestimate when we ﬁnd it, k = 1 means that we can never use as a guess a value greater than imin . For this setting, there is only one possible solution strategy, linear search: The guesses will be p1 = 1, p2 = 2, p3 = 3, · · · All these guesses will be underestimates; when we hit pimin , there will be our ﬁrst and only overestimate. See Figure 6.21. The number of questions will be exactly imin ; that is, in the worst case, the cost will be k = 1 ; q = M. Let us now allow one more overestimate, that is,√ k = 2. Several strategies are now possible. A solution is to partition the interval into M consecutive pieces of size √ M . (If M is not a perfect square, the last interval will be smaller than the others.) See ﬁgure 6.22. √ We will ﬁrst search sequentially among the points a1 = M − 1, a2 = √ 2 M − 2, · · · , until we hit an overestimate. At this point we know the interval where imin is. The second overestimate is then spent to ﬁnd imin inside that interval using sequential search (as in the case k = 1). In the worst case, we have to search all the aj and all of the last interval, that is, in the worst case the cost will be √ k = 2 ; q = 2 M. Notice that by allowing a single additional overestimate (i.e., using an additional n bits) we have been able to reduce the time costs from linear to sublinear. In other words, the trade-off between bits and time is not linear. It is easy to generalize this approach (Exercise 6.6.43) so as to ﬁnd imin with a worst-case cost of k ; q = k M 1/k .

FIGURE 6.22: Dividing the interval when k = 2.

374

SYNCHRONOUS COMPUTATIONS

IMPORTANT. Notice that the cost is a trade-off between questions and overestimates: The more overestimates we allow, the fewer questions we need to ask. Furthermore, the trade-off is nonlinear: The reduction in number of questions achieved by adding a single overestimate is rather dramatic. As every overestimate costs n bits, the total number of bits is O(n k). The total amount of time consumed with this approach is at most O(n k M 1/k ). The Optimal Solution We have just seen a solution strategy for our guessing game when the value to be guessed is in a known interval. How good is this strategy? In the case k = 1, there is only one possible solution strategy. However, for k > 1 several strategies and solutions are possible. Thus, as usual, to answer the above question we will establish a lower bound. Surprisingly, in this process, we will also ﬁnd the (one and only) optimal solution strategy. To establish a lower bound (and ﬁnd out if a solution is good) we need to answer the following question: Q1: What is the smallest number of questions q needed to always win the game in an interval of size M using no more than k overestimates? Instead of answering this question directly, we will “ﬂip its arguments” and formulate another question: Q2: With q questions of which at most k are overestimates, what is the largest M so that we can always win the game in an interval of that size ? We will answer this one. The answer will obviously depend on both q and k, that is, M will be some function h(q, k). Let us determine this function. Some things we already know. For example, if we allow only one overestimate (i.e., k = 1), the only solution strategy is linear search, that is, h(q, 1) = q.

(6.31)

On the contrary, if we allow every question to be an overestimate (i.e., k = q), then we can always win in a much larger interval, in fact (Exercise 6.6.44), h(q, q) = 2q − 1.

(6.32)

Before we proceed, let us summarize the problem we are facing: 1. We have at our disposal q questions of which only k can be overestimates. 2. We must always win. 3. We want to know the size h(q, k) of the largest interval in which this is possible.

MIN-FINDING AND ELECTION: WAITING AND GUESSING

375

FIGURE 6.23: If the initial guess p is an underestimate, the largest interval has size p + h(q − 1, k).

Whatever the strategy be, it must start with a question. Let p be this ﬁrst guess. There are two possibilities; this is either an underestimate or an overestimate. If p is an underestimate (i.e., imin > p), we are left with q − 1 questions, but we still have k overestimates at our disposal. Now, the largest interval in which we can always win with q − 1 questions of which k can be overestimates is h(q − 1, k). This means that if p is the ﬁrst question (Figure 6.23), the largest interval has size h(q, k) = p + h(q − 1, k). On the basis of this, it would seem that to make the interval as large as possible, we should choose our ﬁrst guess p to be as large as possible. However, we must take into account the possibility that our ﬁrst guess turns out to be an overestimate. If p is an overestimate, we have spent both one question and one overestimate; furthermore, we know that the number is in the interval [1, p]. This means that the initial guess p we make must guarantee that we always win in the interval [1, p] with q − 1 questions and k − 1 overestimates. Thus, the largest p can be p = h(q − 1, k − 1). This means that h(q, k) = h(q − 1, k) + h(q − 1, k − 1),

(6.33)

where the boundary conditions are those of expressions 6.31 and 6.32; see Figure 6.24. Solving this recurrence relation (Exercise 6.6.45), we obtain the unique solution h(q, k) =

j =0,k−1

q j

.

(6.34)

FIGURE 6.24: The initial guess p could be an overestimate; this cannot be larger than h(q − 1, k).

376

SYNCHRONOUS COMPUTATIONS

We have found the answer to question Q2. If we now “ﬂip the answer,” we can answer also question Q1 and determine a lower bound on q given M and k. In fact, if M = h(q, k), then the minimum number of questions to always win in [1, M] with at most k overestimates (our original problem) is precisely q. In general, the answer is the smallest q such that M ≤ h(q, k). IMPORTANT. In the process of ﬁnding a lower bound, we have actually found the (one and only) optimal solution strategy to guess in the interval [1, M] with at most k overestimates. Let us examine this strategy. Optimal Search Strategy mates:

To optimally search in [1, M] with at most k overesti-

1. use as a guess p = h(q − 1, k − 1), where q ≥ k is the smallest integer such that M ≤ h(q, k); 2. if p is an underestimate, then optimally search in [p + 1, M] with k overestimates; 3. if it is an overestimate, then optimally search in [1, p] with k − 1 overestimates. This strategy is guaranteed to use the fewest questions. Unbounded Interval We have found the optimal solution strategy using at most k overestimates but assuming that the interval in which imin lies is known. If this is not the case, we can always ﬁrst of all establish an upperbound on imin , thus determining an interval and then search in that interval. To bound the value imin , again we use guesses, g(1), g(2), g(3), . . ., where g : N → Z is a monotonically increasing function. The ﬁrst time we hit an overestimate, say with g(t), we know that g(t − 1) < imin ≤ g(t) and hence the interval to search is [g(t − 1) + 1, g(t)]. See Figure 6.25. This process requires exactly t questions and one overestimate. We are now left to guess imin in an interval of size M = ⌬(t) = g(t) − g(t − 1) + 1 with k − 1 overestimates. (Recall, we just spent one to determine the interval.) Using the optimal solution strategy, this can be done with h(⌬(t), k − 1) questions. The entire process will thus require at most t + h(⌬(t), k − 1) questions of which at most k are overestimates.

FIGURE 6.25: In an unbounded interval, we ﬁrst establish an upper bound on imin .

MIN-FINDING AND ELECTION: WAITING AND GUESSING

Protocol

Bits

Speed SynchStages Wait

O(n log i) O(n log n) O(n)

Guess

O(kn)

Time O(2i n) O( i n log n ) O( i n ) O( i1/k kn)

377

Notes

n known n known

FIGURE 6.26: Using k = O(1), Guessing is more efﬁcient than other election protocols.

Depending on which function g we use, we obtain different costs. For example, choosing g(j ) = 2j (i.e., doubling our guess at every step), t = log imin and ⌬(t) < imin . This means that the number of questions used by the entire process is at most log imin + h( imin , k − 1). Better performances are possible using different functions g; for example (Exercise 6.6.46), with k overestimates, it is possible to reduce the total number of questions to 2 h( imin , k) − 1. Recall that each question costs n time units and if it is an overestimate it also costs n bits. Thus, the complexity of the resulting minimum-ﬁnding protocol Guess becomes O(kn) bits and O(kn ik ). This means that for any ﬁxed k, the guessing approach yields an election protocol that is far more efﬁcient than the ones we have considered so far, as shown in Figure 6.26. Removing the Assumptions Knowledge of n We have assumed that n is known. This knowledge is used only in procedure Decide, employed as a timeout for those entities that do not know if a message will arrive. Clearly the procedure will work even if a quantity n ≥ n is used instead of n, provided. Hence, it is sufﬁcient that the entities know (the same) upperbound n on the network size. Network Topology We have described our protocol assuming that the network is a ring. However, the optimal search strategy for the guessing game is independent of the network topology. To be implemented, it requires subprotocol Decide(p) that has been described only for rings. This protocol can be made universal, and can thus work in every network, by simple modiﬁcations. In fact (Exercise 6.6.47), it sufﬁces: 1. to transform it into a reset with message “High” started by those entities with id(x) ≤ p; and 2. to use as the timeout an upperbound d on the diameter d of the network.

378

SYNCHRONOUS COMPUTATIONS

Notice that each question will now cost d time units. The number w of bits transmitted if the guess is an overestimate depends on the situation; it is, however, always bounded as follows: m ≤ w ≤ 2m. Simultaneous Start We have assumed that all entities start the ﬁrst execution of Decide simultaneously. This assumption can actually be removed by simply using a wake-up procedure at the very beginning (so to bound the delays between initiation times) and using a longer delay between successive guesses (Exercise 6.6.48). 6.3.3 Double Wait: Integrating Waiting and Guessing We have seen two basic techniques, Waiting and Guessing. Their use has led to bitoptimal and time-efﬁcient solutions for the minimum-ﬁnding and election problems; we have described them for ring networks, but we have seen that they are indeed universal. Their only drawback is that they require knowledge of n (or of some upperbound on the diameter d). In contrast, both Speed and SynchStages did not require such an a priori knowledge. If this knowledge is not available, it can, however, be acquired somehow during the computation. We are going to see now how this can be done using both waiting and guessing. We will focus solely on the election problem; thus, we will be operating under restrictions of initial distinct values. Once again, we will restrict the description to unidirectional ring networks. We also assume that all entities start within n − 1 time units from each other (e.g., they ﬁrst execute a wake-up). What we are going to do is to still use the waiting technique to ﬁnd the smallest value; as we do not know n (nor an upperbound on it), we are going to use the guessing strategy to discover an upperbound on n. Let us discuss it in some details. Overall Strategy Each entity is going to execute protocol Wait using a guess g(1) on n. We know that if g(1) ≥ n, then protocol Wait works (Exercise 6.6.31), that is, the entity with smallest value ﬁnishes waiting before all other entities, it becomes small, it sends a message, and its message reaches all other entities while they are still waiting. The problem occurs if g(1) < n; in fact, in this case, it is possible that two or more entities with different ids will stop waiting, become small, and send a message. If we are able to detect if g(1) < n, we can then restart with a different, larger guess g(2) > g(1). In general, if g(j − 1) fails (i.e., g(j − 1) < n), we can restart with a larger guess g(j ) > g(j − 1); this process will terminate as soon as g(j ) ≥ n. Consider now an entity x that in step j ﬁnishes waiting, becomes small, and sends a message. If g(j ) ≥ n, no other entity sends any message, so, after n time units, x receives its own message. By contrast, if g(j ) < n, several entities might become small and originate messages, each traveling along the ring until it reaches

MIN-FINDING AND ELECTION: WAITING AND GUESSING

379

a small entity; hence x would receive the message transmitted by some other entity. Summarizing, in the ﬁrst case, x receives its own message; in the second case, the message was originated by somebody else. Without knowing n, how can x know whether the received message is its own? Clearly, if each message contains the id of its originator, the problem is trivially solved. However, the number of bits transmitted by just having such a message traveling along the ring will be O(n log i), resulting in an unbounded quantity (see Figure 6.26). The answer is provided by understanding how transmission delays work in a synchronous ring. Consider the delay nx (j ) from the time x transmits its message to the time a message arrives at x. If x receives its own message, then nx (j ) = n. By contrast, if x receives the message of somebody else, this will happen before n time units. That is, nx (j ) < n. So what x needs to do is to verify whether or not nx (j ) = n. This can be done by employing the waiting technique again, using nx (j ) for n in the waiting function. If indeed nx (j ) = n, x will again ﬁnish waiting without receiving any message and send a new message, and this message will travel along the ring after exactly nx (j ) = n time units. If instead nx (j ) < n, as we will see, x will notice that something is wrong (i.e., it will receive a message while waiting, it will receive a message before nx (j ) time units, or it will receive no message nx (j ) time units after it sent one, etc.); in this case, it will start the (j + 1)th iteration. Informally the strategy, called DoubleWait, is as follows: Strategy DoubleWait: 1. Each entity will execute a ﬁrst Wait using the current guess g(j ) on the unknown n. Consider an entity x that ﬁnishes waiting without receiving any message. It will send a message “Wait1,” become testing, and wait for a message to arrive keeping track of the time. Let nx (j ) be the delay from when x sent its “Wait1” message to when x received one. If the guess was correct (i.e., g(j ) ≥ n > g(j − 1)), then this message would be the one it sent and nx (j ) = n. 2. If x notices something wrong (e.g., nx (j ) ≤ g(j − 1), or nx (j ) > g(j ), etc.), it will send a “Restart” message to make everybody restart with a new guess g(j + 1). 3. If x does not notice anything wrong, x will assume that indeed tx (j ) = n and will start a second Wait (with a different waiting function) to verify the guess. If the guess is correct, x is the only entity doing so; it should thus ﬁnish waiting without receiving any message. Furthermore, the message “Wait2” it sends now should arrive exactly after nx (j ) time units. 4. If x now notices something wrong (i.e., a message arrives while waiting; a message does not arrive exactly after nx (j ) time units), it will send a “Restart” message to make everybody start with a new guess g(j + 1).

380

SYNCHRONOUS COMPUTATIONS

5. Otherwise, x considers the guess veriﬁed, becomes the leader, and sends a “Terminate” message. 6. An entity receiving a “Wait1’ message while doing the ﬁrst Waiting will forward received messages and wait for either a “Restart” or “Terminate.” In the ﬁrst case it restarts with a new guess; in the second case, it becomes defeated. What we have to show now is that with the appropriate choice of waiting functions, it is impossible for an entity x to be fooled. That is, if x does not notice anything wrong in the ﬁrst and in the second waiting and becomes leader, then indeed the message x receives is its own and nobody else will become leader. Choosing the Waiting Functions What we have to do now is to choose the two waiting functions f and h so that it is impossible for an entity x to be fooled. In other words, it is impossible that the “Wait1” and “Wait2” messages x receives have actually been sent by somebody else, say y and z, and that by pure coincidence both these messages arrived nx (j ) time units after x sent its corresponding messages. IMPORTANT. These functions must satisfy the properties of waiting functions, that is, if g(j ) ≥ n, then for all u and v with id(u) < id(v), f (id(u), j ) + 2(n − 1) < f (id(v), j ) h(id(u), j ) + 2(n − 1) < h(id(v), j ). NOTE. We can assume that the entities start the current stage using guess g(j ) within n − 1 time units from each other; this is enforced in the ﬁrst stage by the initial wake-up, and in the successive stages by the “Reset” messages. To determine the waiting functions f and h we need, let us consider the situation in more details, and let us concentrate on x and see under what conditions it would be fooled. Denote by t(x, j ) the delay between the time the ﬁrst entity starts the j th iteration and the time x starts it. Entity x starts at time t(x, j ), waits f (id(x), j ) time, and then sends its “Wait1” message; it receives one at time t(x, j ) + f (id(x), j ) + nx (j ). Notice that to “fool” x, this “Wait1” message must have been sent by some other entity, y. This means that y must also have waited without receiving any message; thus it sent its message at time t(y, j ) + f (id(y), j ). This message arrives at x at time t(y, j ) + f (id(y), j ) + d(y, x),

MIN-FINDING AND ELECTION: WAITING AND GUESSING

381

where, as usual, d(y, x) is the distance from y to x. Hence, for x to be “fooled,” it must be t(x, j ) + f (id(x), j ) + nx (j ) = t(y, j ) + f (id(y), j ) + d(y, x).

(6.35)

Concentrate again on entity x. After it receives the “Wait1” message, x waits again for an additional h(id(x), j ) time units, and then it sends its “Wait2” message; it receives one after nx (j ) time units, that is, at time t(x, j ) + f (id(x), j ) + nx (j ) + h(v, j ) + nx (j ) = t(x, j ) + f (id(x), j ) + h(id(x), j ) + 2tx (j ). At this point it becomes leader and sends a “Terminate” message. If x has been fooled the ﬁrst time, then also message “Wait2” was sent by some other entity z. It is not difﬁcult to verify that if x has been fooled, then there is only one fooling entity, that is, y = z (Exercise 6.6.49). To have sent a “Wait2” message, y must have not noticed anything wrong (otherwise it would have set a “Reset” instead). This means that similarly to x, y received a “Wait1” message ny (j ) time units after it sent one, that is, at time t(y, j ) + f (id(y), j ) + ny (j ). It waited for another h(y, j ) time units and then sent the “Wait2” message; this message thus arrived at x at time t(y, j ) + f (id(y), j ) + ny (j ) + h(y, j ) + d(y, x). So, if x has been fooled, it must by accident happen that t(x, j ) + f (id(x), j ) + h(id(x), j ) + 2tx (j ) = t(y, j ) + f (id(y), j ) + ny (j ) + h(id(y), j ) + d(y, x).

(6.36)

Subtracting Equation 6.35 from Equation 6.36, we have h(id(x), j ) + nx (j ) = h(id(y), j ) + ny (j ).

(6.37)

Summarizing, x will be fooled if and only if the condition of Equation 6.37 occurs. Notice that this condition does not depend on the ﬁrst waiting function f but only on the second one h. What we have to do is to choose a waiting function h that makes the condition of Equation 6.37 impossible. For example, the function h(id(x), j ) = 2 g(j ) id(x) + g(j ) − nx (j ) is a correct waiting function and will cause Equation 6.37 to become id(x) = id(y).

(6.38)

382

SYNCHRONOUS COMPUTATIONS

As the identities are distinct (because of ID restriction), this means that x = y, that is, the messages x receives are its own. In other words, with this waiting function, nobody will be fooled. Summarizing, regardless of the waiting function f and of the monotonically increasing guessing function g, with the appropriate choice of the second waiting function h, protocol DoubleWait correctly elects a leader. (Exercises 6.6.50, 6.6.51, and 6.6.52.) The Cost of DoubleWait Now that we have established the correctness of the protocol, let us examine its costs. The protocol consists of a sequence of iterations. In iteration j , a guess g(j ) is made on the unknown ring size n. The terminating condition is simply g(j ) ≥ n; in this case, the entity with the smallest value becomes leader; in all other cases, a new iteration is started. The number of iterations j required by the protocol is easily determined. As the protocol terminates as soon as g(j ) ≥ n, j = g −1 (n),

(6.39)

where g −1 is the inverse of g, that is, j is the smallest positive integer j such that g(j ) ≥ n. In an iteration, the guess g(j ) is employed in the execution of a ﬁrst waiting, using waiting function f (x, j ). As a result, either a new iteration is started or a second waiting, using function h(x, j ), is executed; as a result of this other waiting, either the algorithm terminates or a new iteration is started, depending on whether or not g(j ) ≥ n. The overall cost of the protocol depends on the two waiting functions, f and h, as well as on the monotonically increasing function g : N → Z specifying the guesses. To determine the cost, we will ﬁrst examine the number of bits and then determine the time. As we will see, we will have available many choices and, again, we will be facing a trade-off between time and bits. Bits Each iteration consists of at most two executions of the waiting technique (with different waiting functions). Each iteration, except the last, will be aborted and a “Restart” message will signal the start of the next iteration. In other words, each iteration j ≤ j is started by a “Restart” (in the very ﬁrst one it acts as the wake-up); this costs exactly n signals. As part of the ﬁrst waiting, “Wait1” messages will be sent, for a total of n signals. In the worst case there will also be a second waiting with “Wait2” message, causing no more than n signals. Hence, each iteration except the last will cost at most 3n signals. The last iteration has also a “Terminate” message costing exactly n signals. Hence, the total number of bits transmitted by DoubleWait will be at most B[DoubleWait] = 3 c n j + c n = 3 c n g −1 (n) + c n,

383

MIN-FINDING AND ELECTION: WAITING AND GUESSING

where c = O(1) is the number of bits necessary to distinguish between the “Restart,” “Wait1,” “Wait2,” and “Terminate” messages. Time Consider now the time costs of DoubleWait. Obviously, the time complexity of an iteration is directly affected by the values of the waiting functions f and h, which are in turn affected by the value g(j ) they must necessarily use in their deﬁnition. The overall time complexity is also affected by the number of iterations j= g −1 (n) that depends on the choice of the function g. Let us ﬁrst of all choose the waiting functions f and h. The ones we select are f (id(x), j ) = 2 g(j ) id(x),

(6.40)

which is the standard waiting function when the entities do not start at the same time and where g(j ) is used instead of n; and h(id(x), j ) = 2 g(j ) id(x) + g(j ) − nx (j ),

(6.41)

which is the one that, we have already seen, makes “fooling” impossible. With these choices made, we can determine the amount of time the protocol uses until termination. In fact, it is immediate to verify (Exercise 6.6.53) that the number of time units till termination is less than T[DoubleWait] = 2(n − 1) + (4 imin + 2)

j j =1

g(j ).

Again, this quantity depends solely on the choice of the guessing function g.

Trade-offs: Choosing The Guessing Function The results we have obtained for the number of bits and the amount of time are expressed in terms of the guessing function g. This is the only parameter we have not yet chosen. Before we proceed, let us examine what is the impact of such a choice. The protocol terminates as soon as g(j ) ≥ n, that is, after j = g −1 (n) iterations. If we have a fast-growing function g, this will happen rather quickly, requiring few iterations. For example, if we choose g(j ) = 2 g(j − 1) (i.e., we double every time), then j = log n ; we could choose something faster, say g(j ) = g(j − 1)2 (i.e., we square every time) obtainingj = log log n , or g(j ) = 2g(j −1) (i.e., we exponentiate every time) obtaining j = log n , where log denotes the number of times you must take a log before the value becomes 1. So it would seem that to reduce the bit complexity, we need f to grow as fast as possible. By contrast, the value g(j ) is a factor in the time complexity. In particular, the larger is g(j ), the more we have to wait. To understand how bad this impact can be, consider just the very last iteration j and assume that we just missed n, that is g(j − 1) = n − 1. In this last iteration we wait for roughly 4 id(x) g(j) = 4 id(x) g(g −1 (n)) time units.

384

SYNCHRONOUS COMPUTATIONS

g(j ) g(j ) = 2g(j − 1) g(j ) = g(j − 1)2 g(j ) = 2g(j −1)

Bits O(n log n) O(n log log n) O(n log n)

Time O(n i) O(n2 i) O(2n i)

FIGURE 6.27: Some of the trade-offs offered by the choice of g in DoubleWait.

This does not appear to be too bad; after all, g(g −1 (n)) = n. How much bigger than n can g(g −1 (n)) be ? It depends on how fast g grows. If we choose g(j ) = 2 g(j − 1), then g(g −1 (n)) = 2 (n − 1). However, if we choose g(j ) = g(j − 1)2 , then we have g(g −1 (n)) = (n − 1)2 , and the choice g(j ) = 2g(j −1) would give us g(g −1 (n)) = 2(n−1) . Thus clearly, from the time-complexity point of view, we want a function g that does not grow very fast at all. To help us in the decisional process, let us restrict to a class of functions. A function g is called superincreasing if for all j > 1

g(j ) ≥

j −1

g(s).

(6.42)

s=1

If we restrict ourselves to superincreasing functions, then the bit and time costs of DoubleWait become (Exercise 6.6.54) B[DoubleWait] ≤ 3 c n g −1 (n) + c n

(6.43)

T[DoubleWait] ≤ 2(n − 1) + (8 imin + 2) g g −1 (n) .

(6.44)

These bounds show the existence and the nature of the trade-off between time and bits. Some interesting choices are shown in Figure 6.27. Examining the trade-off, we discover two important features of protocol DoubleWait: 1. the bit complexity is always independent of the entities values and, thus, bounded; 2. the time complexity is always linear in the smallest entity value. Comparing the cost of Double Wait with the cost of the other ring election protocols that do not require knowledge of (an upperbound on) n, it is clear that DoubleWait outperforms Speed that has an unbounded bit complexity and a time complexity exponential in the input values. As for SynchStages, notice that by choosing g(j ) = 2g(j − 1), DoubleWait has the same bit costs but a better time complexity (see Figure 6.28); with a different choice of g, it is possible to have the same time of SynchStages but with a smaller bit complexity (Exercise 6.6.55).

SYNCHRONIZATION PROBLEMS: RESET, UNISON, AND FIRING SQUAD

Protocol

Bits

Time

Speed SynchStages DoubleWait

O(n log i) O(n log n) O(n g −1 (n))

O(n 2i ) O(n log n i) O(g(g −1 (n)) i)

Wait Guess

O(n) O(kn)

O(n i) O(k n i1/k )

n known n known

Symmetry

O(n)

O(n)

n known; randomized

385

Notes

FIGURE 6.28: Summary of Election techniques for synchronous rings.

Notice that the bit complexity can be asymptotically reduced to O(n), matching the one obtained by the protocols, Wait and Guess that assume knowledge of an upperbound on n; clearly this is achieved at the expense of an exorbitant time complexity. An exact O(n) bit complexity with a reasonable time can, however, be achieved without knowing n using DoubleWait in conjunction with other techniques (Problem 6.6.9). 6.4 SYNCHRONIZATION PROBLEMS: RESET, UNISON, AND FIRING SQUAD A fully synchronous system is by deﬁnition highly synchronized, so it might appear strange to talk about the need for synchronization in the system and the computational problems related to it. Regardless of the oddity, the need and the problems exist and are quite important. There is ﬁrst of all a synchronization problem related to the local clocks themselves. We know that in a synchronous environment all local clocks tick at the same time; however, they might not sign the same value. A synchronous system is said to be in unison if indeed all the clock values are the same. Notice that once a system is in unison, it will remain so unless the values of some clocks are locally altered. The unison problem is how to achieve such a state, possibly with several independent initiators. Then there two synchronization problems related to the computational states of the entities. The ﬁrst of them we have already seen, the wake-up or reset problem: All entities must enter a special state (e.g., awake); the process can be started by any number of entities independently. Notice that in this speciﬁcation there is no mention of when an individual entity must enter such a state; in fact, in the solutions we have seen, entities become awake at different times. Also, in the ﬁring squad problem all entities must enter a special state (usually called ﬁring), but they must do so at the same time and for the ﬁrst time. Firing squad synchronization is obviously stronger than reset. It is also stronger than unison: With unison, all entities arrive at a point where they are operating with the same clock value, and thus, in a sense, they are in the same “state” at the same time; however, the entities do not necessarily know when.

386

SYNCHRONOUS COMPUTATIONS

We are going to consider all three problems and examine their nature and interplay in some details. All of them will be considered under the standard set of restriction R plus obviously Synch. 6.4.1 Reset/Wake-Up In reset, all entities must enter the same state within ﬁnite time. One important application of reset is when a distributed protocol is only initiated by a subset of the entities in the system, and we need all entities in the system to eventually begin executing the protocol. When reset is applied at the ﬁrst step of a protocol, it is called wake-up. The wake-up or reset problem is a fundamental problem and we have extensively examined in asynchronous systems. In fully synchronous systems it is sometimes also called weak unison; its solution is usually a preliminary step in larger computations (e.g, Wait, Guess, DoubleWait), and it is mostly used to keep the initiation times of the main computation bounded. For example, in protocol Wait applied to a network G (not necessarily a ring) of known diameter d, the initial wake-up ensures that all entities become awake within d time units from the start. For computations that use wake-up as a tool, their cost obviously depends on the cost of the wake-up. Consider for example electing a leader in a complete graph Kn using the waiting technique. Not counting the wake-up, the election will cost only n − 1 bits, and it can be done in 4imin + 1 time units (see Equation 6.29); recall that in a complete graph, d = 1. Also, the wake-up can be done fast, in 1 time unit, but this can cost O(n2 ) bits. In other words, the dominant bit cost in the entire election protocol is the one of the wake-up, and it is unbearably high. Sometimes it is desirable to obtain wake-up protocols that are slower but use fewer transmissions. In the rest of this section we will concentrate on the problem of wake-up in a complete network. The difﬁculty of waking up in asynchronous complete networks, which we discussed in Section 2.2, does not disappear in synchronous complete networks. In fact, in complete networks where the port numbers are arbitrary, ⍀(n2 ) signals must be sent in the worst case. Theorem 6.4.1 In a synchronous complete network with arbitrary labeling, wake-up requires ⍀(n2 ) messages in the worst case. To see why this is true, consider any wake-up protocol W that works for any complete networks regardless of the labeling. By contradiction, let W use o(n2 ) signals in every complete network of size n. We will ﬁrst consider a complete network Kn1 with chordal labeling: A Hamiltonian cycle is identiﬁed, and a link (x, y) is labeled with the distance from x to y according to that cycle. The links incident on x will, thus, be labeled 1, 2, . . . , n − 1. On this network, we will consider the following execution: E1 : Every entity starts the wake-up simultaneously.

SYNCHRONIZATION PROBLEMS: RESET, UNISON, AND FIRING SQUAD

387

Concentrate on an entity x; let L(x) be the set of port numbers on which a message was sent or received by x during this execution. Observe that because all entities start at the same time and because of the symmetry of the labeling, L(x) = L(y) for all entities x and y. In fact, if x sends a signal via port number j , so will everybody else, and all of them will receive it from port number n − j . As protocol W is correct, within a ﬁnite number t of time time units, all the entities terminate. As, by assumption, every execution uses only o(n2 ) signals, |L(x)| = l = o(n). We construct now a complete network Kn2 with a different labeling. In this network, we select l + 1 entities x0 , x1 , . . . , xl , and label the links between them with a “almost chordal” labeling using the labels in L(x). All others links in the network are labeled arbitrarily without violating local orientation (this can always be done: Exercises 6.6.57 and 6.6.58). In this network consider the following execution: E2 : Only the selected entities will start and will do so simultaneously. In this execution only few (|L(x)| + 1 = o(n)) entities start. From the point of view of these initiators, everything in this execution happens exactly as if they were in the other execution in the other network: Messages will be sent and received exactly from the same ports in the same steps in both executions. In particular, none of them will send a signal outside its “little clique.” Hence, none of the other nodes will receive any signal; as those entities did not wake up spontaneously, this means that none of them will wake up at all. In particular, none of them will send any signal to the initiators; hence no initiator will receive a signal from outside the “little clique.” Therefore, the initiators will act as if they are in Kn1 and the execution is E1 ; thus, at time t the initiators will all terminate the execution of the protocol. However, the majority of the nodes is not awake, nor will it ever become awake, contradicting the correctness of the protocol. In other words, there is no correct wake-up protocol for the complete networks that will always require less than O(n2 ) transmissions. Summarizing, regardless of the protocol and the techniques (e.g., communicator, pipeline, waiting, guessing, etc.), and regardless of the fact that we can use time as a computational tool, wake-up will cost ⍀(n2 ) signals in the worst case. 6.4.2 Unison A synchronous system is said to be in unison if all the clock values are the same. The unison problem is how to achieve such a state, possibly with several independent initiators. Notice that once a system is in unison, it will remain so unless the values of some clocks are locally altered. Let us examine a very simple protocol for achieving unison. Each entity will execute a sequence of stages, each taking one unit of time, starting either spontaneously or upon receiving a message from another entity. Protocol MaxWave: 1. An initiator x starts by sending to all its neighbors the value of its local clock cx .

388

SYNCHRONOUS COMPUTATIONS

2. A noninitiator y starts upon receiving messages from neighbors: It increases those values by one time unit, computes the largest among these values and its own clock value, resets its clock to such a maximum, and sends it to all its neighbors. 3. In stage j > 1, an entity (initiator or not) checks the clock values it receives from its neighbors and increases each one of them by one time unit; it then compares these values with each other as well as with its own. If the value of the local clock is maximum, no message is sent; else, the local clock is set to the largest of all values, and this value is sent to all the neighbors (that sent a smaller value). Consider the largest value tmax among the local clocks when the protocol starts. It is not difﬁcult to see that this value (increased by one unit at each instant of time) reaches every entity, and every entity will set its local clock to such a time value (Exercise 6.6.59). In other words, with this simple protocol, that we shall call MaxWave, the entities are guaranteed to operate in unison within ﬁnite time. Let us discuss how long this process takes. Unison happens as soon as every entity whose initial clock value was smaller than tmax receives tmax (properly incremented). In the worst case, only one entity z has tmax at the beginning, and this entity is the last one to start. This value (properly incremented) has to reach every other entity in the network; this propagation will require at most a number of time units equal to the diameter d of the network; as z will start at most d time units after the ﬁrst entity, this means that the system operates in unison after at most 2d time units from the start. How can an entity detect termination ? How does it know whether the system is now operating in unison ? Necessarily, an entity must know d (or an upperbound on d, e.g., n) to be able to know when the protocol is over. The amount 2d is from the (global) time t the ﬁrst entities started the execution of the protocol. An entity x starts participating at some (global) time t(x) ≥ t. Thus, assuming that (an upperbound on) d is known a priori to all entities, at time t(x) + 2d entity x knows for sure that the system is operating in unison. (this time can actually be reduced; see Exercise 6.6.60). In other words, entities may terminate at different times; their termination will, however, be within at most d time units from each other. What is the number of messages that will be transmitted? A very rough overestimate is easily obtained by assuming that each entity x transmits to all its |N (x)| neighbors in each of the 2d time units; this gives 2d

x

|N (x)| = 4 d m.

This is a gross overestimate. In fact, once an entity receives the max time, it will transmit only in this step and no more. So the entities with the largest value will transmit to their neighbors only once; their neighbors will transmit only twice; in general, the entities at distance j from the entities with the largest value will transmit

SYNCHRONIZATION PROBLEMS: RESET, UNISON, AND FIRING SQUAD

389

only j + 1 time. We also know that an entity does not send the max time to those neighbors from which it received it. The actual cost depends on the topology of the network and the actual initiation times. For some networks, the cost is not difﬁcult to determine (Exercises 6.6.61 and 6.6.62). Assuming that we are operating not on an arbitrary graph but on a tree (e.g., a previously constructed spanning tree of the network), we immediately have m = n − 1; we can make accurate measurements (Exercise 6.6.63). In all this discussion, we have made an implicit assumption that the clock values we are sending are bounded and ﬁt inside a message. However, time and thus the clock values are unbounded. In fact, clock values increase at each time unit; in our protocol, the transmitted values were increased at each time unit and the largest was propagated. Therefore, the solution we have described is not feasible. To ensure that the values are bounded, we concentrate on the deﬁnition of the problem: Our goal is to achieve unison, that is, we want all local clocks to sign the same value. Notice that the deﬁnition does not care for what that value is, but only for that it is the same for all entities. Armed with this understanding, we make a very simple modiﬁcation to the MaxWave protocol: When an entity starts MaxWave, it ﬁrst resets its local clock to 0. In this way, the maximum value transmitted is at most 2d (Exercise 6.6.64), which is bounded. 6.4.3 Firing Squad Firing squad synchronization is a problem stricter than unison. It requires that all entities enter a predeﬁned special state, ﬁring, for the ﬁrst time simultaneously. More precisely, all the entities are initially in active state, and each active entity can at any time spontaneously become excited. The goal is to coordinate the entities so that, within ﬁnite time from the time the ﬁrst entity becomes excited, all entities become ﬁring simultaneously and for the ﬁrst time. In its original form, the problem was described for synchronous cellular automata (i.e., computational entities with O(1) memory) placed in a line of unknown length n, and where the leftmost entity in the line is the sole initiator, known as the “general”. Note that as cellular automata only have a constant memory size, they cannot represent (nor count up to) nonconstant values such as n or d. We are interested in solving this problem in our setting, where the entities have at least O(logn) bits of local memory, and thus they can count up to n. Again we are looking for a protocol that can work in any network; observe that the entities need to know or to compute (an upperbound on) d to terminate. If the network is a tree, or we have available a spanning tree of the network, then a simple efﬁcient solution exists, on the basis of saturation (Exercise 6.6.68). This protocol uses at most 3n − 2 signals and n − 2 messages each containing a value of at most d, for a total of O(n log n) bits; the time is at most 3d − 3. The bit complexity can be reduced to O(n) still using only O(n) time (Exercise 6.6.69). That is, ﬁring

390

SYNCHRONOUS COMPUTATIONS

squad can be solved in networks with an available spanning tree in optimal time and bits. What happens if there is no spanning tree available? Even worse, what happens if no spanning tree is constructible (e.g., in anonymous network)? The problem can still be solved. To do so, let us explore the relationship between ﬁring squad and unison. First observe that as all entities become ﬁring simultaneously, if each entity resets its local clock when it becomes ﬁring, all local clocks will have the same value 0 at the same time. In other words, any solution to the ﬁring squad problem will also solve the unison problem. The converse is not necessarily true. In unison, all the local clocks will at some point sign the same value; however, the entities might not know exactly when this happens. They might become aware (i.e., terminate) at different times; but for ﬁring squad synchronization we need that they make a decision simultaneously, that is, with no difference in time. Surprisingly, protocol MaxWave actually solves the ﬁring squad problem in networks where no spanning tree is available. To see why this is true, consider the modiﬁcation we made to ensure that the transmitted values are bounded: When an entity starts the protocol, it ﬁrst resets its local clock to 0. Let t be the global time when the protocol starts, that is, t is the time when the ﬁrst entities rest their clock to 0. We will call such entities “initiators.” Two simple observations (Exercises 6.6.70 and 6.6.71): Property 6.4.1 1. If a message originated by an initiator reaches entity y at time t + w, then the value of that message (incremented by 1) is exactly w. 2. Regardless of whether y has already independently started or starts now, the current value of its local clock will be smaller than w; thus, y will set its clock in unison with the clocks of the initiators. Summarizing, every noninitiator receives a message from the initiators, and as soon as an entity receives a message originated by the initiators (i.e., carrying the max reset time), it will become in unison with the initiators. Thus, an entity x is in unison with the initiators at time t + d(x, I ), where d(x, I ) denotes the distance between x and the closest initiator. As d(x, I ) ≤ d, this means that all clocks will be in unison after at most d time units from the start. Once the clocks are in unison, unless someone resets them, they keep on being in unison. As nobody is resetting the clocks again, this means that all entities will be in unison at time t + d. The value of the clocks at that time is exactly d. This means that when the reset local clock signs time d, the entity knows that indeed the entire system is in unison; if the entity enters state ﬁring at this time, it

BIBLIOGRAPHICAL NOTES

391

is guaranteed that all other entities will do the same simultaneously, and for the ﬁrst time, solving the ﬁring squad problem. Summarizing, protocol MaxWave solves the ﬁring squad problem in d time units: T[MaxWave] = d,

(6.45)

and this is worst-case optimal. The number of messages is less than 2 d m and each contains at most log d bit, that is, B[MaxWave] < 2 m d log d.

(6.46)

The bit complexity can be reduced at the expense of time, by using communicators to communicate the content of the messages (Exercises 6.6.66 and 6.6.67). 6.5 BIBLIOGRAPHICAL NOTES Some of the work on synchronous computing was done very early on in the context of Cellular Automata and Systolic Arrays; in particular, pipeline is a common computational tool in VLSI systems (which include systolic arrays). In the framework of distributed computing, the ﬁrst important result on (faultfree) synchronous computations is protocol Speed designed by Greg Frederickson and Nancy Lynch [9], and independently by Paul Vitanyi [26] (whose version of the protocol actually works with a weaker form of full synchrony, called Archimedean Time Assumption or ATA). This result has alerted algorithmic researchers to the existence of the ﬁeld. Some of the ﬁrst improvements were due to Eli Gafni [11] and Alberto Marchetti-Spaccamela [17], who reduced the time but still kept the unbounded bit complexity. Subsequent improvements to bounded bit complexity and to reduced time costs were obtained by using (and combining) communicators, waiting and guessing. Communicators have been used for a while. The so-called “one-bit” protocol (e.g., see Problem 6.6.1) was originally proposed and used by Hagit Attiya, Marc Snir, and Manfred Warmuth [3] and later rediscovered by Amotz Bar-Noi, Joseph Naor, and Moni Naor [4]. The size communicator is due to Bernd Schmeltz [24]. C2 is “folk” knowledge, while C3 is due to Paul Vitanyi [unpublished]. The optimal kcommunicators have been designed by Una-May O’Reilly and Nicola Santoro [20]. The ﬁrst combined use of communicators and pipeline is due to B. Schmeltz [24]. The computations in trees using pipeline are due to Paola Flocchini [8]. The asynchronous-to-synchronous transform is due to Una-May O’Reilly and Nicola Santoro [19]. The waiting technique was independently discovered by Eli Gafni [11], who used it to reduce the time costs of Speed, and by Nicola Santoro and Doron Rotem [23], who designed protocol Wait. Protocol Guess has been designed by Jan van Leeuwen, Nicola Santoro, Jorge Urrutia, and Shmuel Zaks [16]. Double Waiting is due to Mark Overmars and Nicola Santoro [21].

392

SYNCHRONOUS COMPUTATIONS

The ﬁrst bit-optimal election protocol for rings is due to Hans Bodlaender and Gerard Tel [5]; it does, however, require exponential time. The time has been subsequently drastically reduced (Problem 6.6.9) without increasing the bit complexity by Mark Overmars and Nicola Santoro [21]. The problem of symmetry breaking was ﬁrst studied for rings by Alon Itai and Michael Rodeh [14] and for other networks by Doron Rotem and Nicola Santoro [23]. The simpler and more efﬁcient protocol Symmetry has been designed by Greg Frederickson and Nicola Santoro [10]. These results have been extended to environments with ATA-synchrony by Paul Spirakis and Basil Tampakas [25]. The maximumﬁnding protocol for rings of Problem 6.6.7 has been designed by Paola Alimonti, Paola Flocchini, and Nicola Santoro [1]. The trade-offs for wake-up in complete graphs with chordal labeling are due to Amos Israeli, Evangelos Kranakis, Danny Krizanc, and Nicola Santoro [13]. The unison problem has been ﬁrst studied (in a slightly different context) by Shimon Even and Sergio Rajsbaum [6, 7], and in the context of self-stabilization by Mohamed Gouda and Ted Herman [12]. Bounding the message size was studied by Anish Arora, Shlomi Dolev, and Mohamed Gouda [2], always in the context of self-stabilization. The ﬁring squad problem was originally proposed for Cellular Automata by J. Myhill and reported by E. Moore [18]. In our context, the problem was ﬁrst studied for synchronous trees by Raul Ramirez and Nicola Santoro [22]; the optimal solution has been designed by Ephraim Korach, Doron Rotem, and Nicola Santoro [15]. The universal protocol MaxWave is a simple extension of existing unison solutions.

6.6 EXERCISES, PROBLEMS, AND ANSWERS 6.6.1 Exercises Exercise 6.6.1 Determine the number of messages of protocol Speed if the waiting function is f (v) = cv , for an integer c > 2. Exercise 6.6.2 Determine the number of messages of protocol Speed if the waiting function is f (v) = vc , for an integer c > 1. Exercise 6.6.3 Modify protocol Speed so that even if the entities do not start simultaneously, a leader is elected with O(n) messages. Exercise 6.6.4 Prove that Protocol Speed requires 2i n time units. Exercise 6.6.5 Modify protocol C2 so that it communicates any integer i, positive or negative, transmitting 2 bits and O(|i|) time units. Exercise 6.6.6 Construct a protocol R2 that communicates any positive integer I transmitting 2 bits and only 2 + I4 time units.

EXERCISES, PROBLEMS, AND ANSWERS

393

Exercise 6.6.7 Consider protocol TwoBits when each packet contains c > 1 bits. Use the content of the packets to convey information about the value i to be communicated. Determine the time costs that can be achieved. Exercise 6.6.8 Construct a √ protocol R3 that communicates any positive integer I transmitting 3 bits and only I + 3 time units. Exercise 6.6.9 Consider a system where packets contain c > 1 bits. Modify protocol R3 using the content of the packets so as to reduce the time costs. Determine the amount of savings that can be achieved. Exercise 6.6.10 Prove that the communicator described in Section 6.2.1 uses at 1 most O(i k ) time units. Exercise 6.6.11 Use the content of the transmitted bits so as to reduce the time costs of the communicator described in Section 6.2.1. Show how a time cost of at 1 most (k − 1)(I /4) k−1 + k clock ticks can be achieved. Exercise 6.6.12 Prove that communicator Orderk uses f (I, k) + k + 1 time

to comt +k municate I , where f (I, k) is the smallest integer t such that I ≤ . k Prove that communicator Orderk+ uses g(I, k) + k + 1 time

to t + k communicate I , where g(I, k) is the smallest integer t such that I ≤ 2k+1 . k

Exercise 6.6.13

Exercise 6.6.14 Prove that ω(t, k) =

t +q . q

Exercise 6.6.15 Prove that any protocol using k + 1 corruptible bits to communicate values from U requires

f |U |, k

2

|U | −

i

0≤i 2; thus, they cannot be used in pipeline for computing the minimum. Determine a class MonotoneOrderk of optimal corruption-tolerant communicators that are monotonically increasing. Exercise 6.6.22 Communicators Order+k are optimal but not monotonically increasing for k > 2; thus, they can not be used in pipeline for computing the minimum. Determine a class MonotoneOrder+k of optimal communicators that are monotonically increasing. Exercise 6.6.23 Write a protocol for ﬁnding the largest value in a chain using the 2-bit communicator and pipeline. Prove its correctness. Exercise 6.6.24 Minimum-Finding in Pipeline. Write a protocol for ﬁnding the smallest value in a chain using the 2-bit communicator and pipeline. Prove its correctness. Determine its costs. Exercise 6.6.25 Sum-Finding in Pipeline. Write a protocol for ﬁnding the sum of all the values in a chain using the 2-bit communicator and pipeline. Prove its correctness. Determine its costs. Exercise 6.6.26 Protocol SynchStages is the transformation of Stages using communicator TwoBits. Add pipeline to this protocol to convey information from a candidate to a neighboring one. Prove its correctness. Analyze its costs; in particular, determine the reduction in time with respect to the nonpipelined version. Exercise 6.6.27 Modify protocol Wait so that it ﬁnds the minimum value only among the initiators.

EXERCISES, PROBLEMS, AND ANSWERS

395

Exercise 6.6.28 Determine the smallest waiting function that allows protocol Wait to work correctly without simultaneous initiation: (a) in a unidirectional ring; (b) in a bidirectional ring. Exercise 6.6.29 Determine the smallest waiting function that allows protocol Wait to work correctly with simultaneous initiation: (1) in a a × b mesh; (2) in a a × b torus; (3) in a k-dimensional hypercube; (4) in a complete network. Exercise 6.6.30 Determine the smallest waiting function that allows protocol Wait to work correctly without simultaneous initiation: (1) in a a × b mesh; (2) in a a × b torus; (3) in a k-dimensional hypercube. Exercise 6.6.31 Prove that protocol Wait would work even if a quantity n ≥ n is used instead of n. Exercise 6.6.32 Determine under what conditions protocol Wait would work if a quantity n > n is used instead of n in the waiting function. Exercise 6.6.33 Assuming distinct initial values, characterize what would happen to protocol Wait in a ring network if each entity x uses 2id(x)2 as its waiting function. In particular, determine under what conditions the protocol would certainly work. Exercise 6.6.34 Under the conditions of Exercise 6.6.33, show how all the entities can efﬁciently detect whether the protocol does not work. Exercise 6.6.35 Determine the cost of computing the AND of all input values in a synchronous ring of known size n using protocol Waiting. Exercise 6.6.36 Describe how to efﬁciently use protocol Wait to compute the OR of the input values in a synchronous ring of known size n. Determine its cost. Exercise 6.6.37 Modify protocol Symmetry so that it works efﬁciently in a bidirectional square torus of known dimension. Determine its exact costs. Exercise 6.6.38 Modify protocol Symmetry so that it works efﬁciently in a unidirectional square torus of known dimension. Determine its costs. Exercise 6.6.39 Prove that with simultaneous initiation, protocol Symmetry can be modiﬁed so as to work correctly in every network of known girth. (Hint: Use the girth instead of n in the waiting function.) Exercise 6.6.40 Determine the complexity of protocol Symmetry if we use in random selection criteria b = n and choose each value with the same probability n1 .

396

SYNCHRONOUS COMPUTATIONS

Exercise 6.6.41 Modify protocol Decide so as to compute the OR of the input values in a synchronous ring of known size n. Prove its correctness and determine its cost. Exercise 6.6.42 Write protocol Guess and implement it; throughly test your implementation. Exercise 6.6.43 questions.

Show how to ﬁnd imin with k overestimates using q = k M 1/k

Exercise 6.6.44 Show how we can always win the guessing game in an interval of size 2q − 1 with q question if they are all allowed to be overestimates. Exercise 6.6.45 Show how to obtain a unique solution to the recurrence relation of expression 6.33. Exercise 6.6.46 Determine a function g to bound imin so that the total time for ﬁnding with k overestimates is at most 2 h( imin , k) − 1. Exercise 6.6.47 Modify subprotocol Decide(p) so that it will work in every network, regardless of its topology. Assume that an upperbound on the diameter of the network is known a priori. Prove its correctness. Exercise 6.6.48 Modify subprotocol Decide(p) so that protocol Guess works correctly even if the entities do not start simultaneously. Exercise 6.6.49 Prove that, in DoubleWait, if x is being “fooled,” then both the “Wait1” and the “Wait2” message it receives are sent by the same entity. Exercise 6.6.50 Let the entities start the j th iteration of DoubleWait within n − 1 time units from each other. Prove that the entity with the smallest value becomes leader and all other will become defeated in that iteration. Exercise 6.6.51 Let the entities start the j th iteration of DoubleWait within n − 1 time units from each other. Prove that if an entity x becomes leader in this iteration, then g(j ) ≥ n > g(j − 1). Exercise 6.6.52 Let the entities start the j th iteration of DoubleWait within n − 1 time units from each other. Prove that if g(j ) < n, then all entities start the (j + 1)th iteration within n − 1 time units from each other. Exercise 6.6.53 Prove that the time used by protocol DoubleWait, with the choices of f and h speciﬁed by Expressions 6.40 and 6.41, is at most 2(n − 1) + (4 imin + 2) j j =1 g(j ).

EXERCISES, PROBLEMS, AND ANSWERS

397

Exercise 6.6.54 Consider protocol DoubleWait, where f and h are as in Expressions 6.40 and 6.41, and g is superincreasing. Prove that the time is at most 2(n − 1) + (8 imin + 2) g(g −1 (n)). Exercise 6.6.55 Consider protocol DoubleWait, where f and h are as in Expressions 6.40 and 6.41. Determine the number of bits if the time is O(n log n i). Exercise 6.6.56 () Determine whether or not there is a choice of g that makes DoubleWait more efﬁcient than SynchStages in both time and bits. Exercise 6.6.57 Let L = (a1 , b1 ), . . . , (ak , bk ) be the k pairs of distinct labels ai , bi ∈ {1, . . . , n}. Consider now a complete network of n nodes; in this network, select 2k + 1 nodes x0 , x1 , , . . . , x2k . Show that it is always possible 1. to label the links between these nodes only with pairs from L (e.g., the link (x0 , x1 ) will be labeled a3 at x0 and b3 at x1 ), and 2. to label all others links in the network with labels in {1, . . . , n} without violating local orientation anywhere. Exercise 6.6.58 Consider exactly the same question as in Exercise 6.6.57, where, however, n is even and exactly one pair in L, say (a1 , b1 ) is composed of identical labels, i.e., a1 = b1 . Exercise 6.6.59 Prove that in protocol MaxWave, the largest of the local clock values (when the execution starts) will reach (properly increased) every entity, and each entity will set its local clock to such a (properly increased) time value. Exercise 6.6.60 Consider protocol MaxWave when the entities do not start necessarily at the same time, and let d be known. Let t be the (global) time the ﬁrst entities start the execution of the protocol and let t(x) ≥ t be the global time when x starts. Modify the protocol so that (eventhough x does not know t) at time t + 2d it knows for sure that the system is operating in unison. Exercise 6.6.61 Determine the message cost of protocol MaxWave a. in a unidirectional ring, b. in a bidirectional ring. You may assume that n is known. Exercise 6.6.62 Determine the message cost of protocol MaxWave in a kdimensional hypercube. Exercise 6.6.63 Determine the worst-case and average-case message costs of protocol MaxWave in a tree network.

398

SYNCHRONOUS COMPUTATIONS

Exercise 6.6.64 Let, in protocolMaxWave, each entity reset its local clock to 0 when it starts the protocol. Prove that in this way, the maximum value transmitted is at most 2d. Exercise 6.6.65 Consider the unison protocol MinWave where instead of setting the clocks to and propagating the largest value, we set the clock to and propagate the smallest value. Prove correctness, termination, and costs of protocolMinWave. Exercise 6.6.66 Determine the bit and time costs of protocol MaxWave if the content of a message is communicated using the 2-bit communicator. Exercise 6.6.67 Determine the bit and time costs of protocol MaxWave if the content of a message is communicated using a k-bit communicator. Exercise 6.6.68 Show how to solve the ﬁring squad problem on a tree using at most 4n − 4 messages, each containing a value of at most d, and in time at most 3d − 3. Exercise 6.6.69 () Show how to solve the ﬁring squad problem on a tree using only O(n) bits in O(d) time. Exercise 6.6.70 In protocol MaxWave, let a message originated by an initiator reach another entity y at time t + w. Prove that the value of that message (incremented by 1) is exactly w. Exercise 6.6.71 In protocol MaxWave, let a message originated by an initiator reach another entity y at time t + w. Prove that regardless of whether y has already independently started or starts now, the current value of its reset local clock will be smaller than w; thus, y will set its clock in unison with the clocks of the initiators. 6.6.2 Problems Problem 6.6.1 (OneBit Protocol) Determine under what conditions information can be communicated using only 1 bit and describe the corresponding OneBit protocol. Problem 6.6.2 (BitPattern Communicator) Consider the class of communicators that use a bit set to 1 to denote termination. Determine the minimum cost that can be achieved and design the corresponding protocol. Problem 6.6.3 (2-BitPattern Communicator) () Consider the class of communicators that use two successive transmissions of 1 to denote termination. Determine the minimum cost that can be achieved and design the corresponding protocol. Problem 6.6.4 (Size Communicator) Consider the class of communicators that use the ﬁrst quantum to communicate the total number of bits that will be transmitted.

EXERCISES, PROBLEMS, AND ANSWERS

399

Determine the minimum cost that can be achieved and design the corresponding protocol. Problem 6.6.5 (Pipeline in Trees: Max) Write the protocol for ﬁnding the maximum of all the values in a tree using the 2-bit communicator and pipeline. Prove its correctness. Determine its costs. Problem 6.6.6 (Pipeline in Trees: Min) Write the protocol for ﬁnding the minimum of all the values in a tree using the 2-bit communicator and pipeline. Prove its correctness. Determine its costs. Problem 6.6.7 (Maximum Finding I) () Consider a ring of known size n. Each entity has a positive integer value; they all start at the same time, but their values are not necessarily distinct. The maximum-ﬁnding problem is the one of having all the entities with the largest value become maximum and all the other small. Design a protocol to solve the maximum-ﬁnding problem in time linear in imax using at most O(n log n) bits. Problem 6.6.8 (Maximum Finding II) () Determine whether the maximumﬁnding problem in a ring of known size can be solved in time linear in imax with O(n) bits. Problem 6.6.9 (Bit-Optimal Election I) () Show how to elect a leader in a ring with only O(n) bits without knowing n. Possibly the time should be polynomial in i or exponential in n. (Hint: Use a single iteration of DoubleWait as a preprocessing phase.) Problem 6.6.10 (Bit-Optimal Election II) () Determine whether or not it is possible to elect a leader without knowing n with ⌰(n) bits in time sublinear in i, that is, to match the complexity achievable when n is known. Problem 6.6.11 (Unison without knowing d) () Consider the unison problem when there is no known upperbound on the diameter d of the network. Prove or disprove that in this case the unison problem cannot be solved with explicit termination. Problem 6.6.12 (Firing in a Line of CA with 6 States) () Finite cellular automata (CA) can only have a constant memory size, which means they cannot store a counter. The goal is thus to solve the ﬁring squad problem with the least amount of time and to do so with the least amount of memory. The measure we use for the memory is the max number of different values that can to be stored in the memory, and it is called the number of states of the automaton. Consider a line of CA with only one initiator (located at the end of the line). Develop a solution that uses only six states. Problem 6.6.13 (Firing in a Line of CA with 5 States) () Consider a line of CA with only one initiator (located at the end of the line). Develop a solution using only ﬁve states or prove it can not be done.

400

SYNCHRONOUS COMPUTATIONS

6.6.3 Answers to Exercises Answer to Exercise 6.6.4 Consider the entity x that will become leader. It did spontaneously initiate the protocol; its message traveled along the ring at the speed of f (ix ) + 1 = 2ix + 1, where ix is the input value of x; hence, its message returned after (n − 1)(2ix + 1) time units; another n time units are required for the notiﬁcation message. Answer to Exercise 6.6.6 Let

b0 =

0 if I even 1 if I odd

.

If we were to encode I in the sequence b1 | I2 | b0 , the receiver can reconstruct I using as a decoding function decode(b0 | q1 | b1 ) = 2q1 + b0 , where b0 is used as an integer value. In this way, we have effectively cut the quantum of time in half: The waiting time becomes 2 + I2 . It can be actually further reduced. Let b1 =

0 if I2 even 1 if I2 odd

.

I | b0 , the receiver can If we were to encode I in the sequence b1 | 22 reconstruct I using as a decoding function decode(b0 | q1 | b1 ) = 2(2q1 + b1 ) + b0 , where both b0 and b1 are treated as integer values. The waiting time then becomes 2 + I4 . Answer to Exercise 6.6.6 Consider the √ following communicator R3 : The ﬁrst bit, b0 , is used to indicate I is odd; the second bit, b2 , is used to indicate whether z = whether y = √ 2 I − I is odd; the third bit, b3 , is used to indicate whether w = 2z is odd. y w The two quanta waited are q1 = 2 and q2 = 2 . To obtain I the receiver simply computes (2q1 + b0 )2 + (4q2 + 2b1 + b2 ), where the bits are treated as integer values. For example, if I = 7387, we have y = 85, z = 162, and w = 81; thus, the two quanta are q1 = 42 and q2 = 40, while the bits are b0 = 1, b1 = 0, and b2 = 1. The quantity (2q0 + b0 )2 + (4q 1 + b2 ) computed by 1√+2b √ y I the receiver is indeed I . Notice that q0 = 2 = and, as z ≤ 2 I , 2 z √ w I 2 ≤ 2 ; thus, this protocol has time-bits complexity at most q1 = 2 = 2 √ 3, I + 3. The protocol is correct (Exercise 6.6.11). Exactly k − √ 1 quanta will be used, √and Ii ; since I2i = Ii k bits will be transmitted. It is easy to verify that I2i+1 ≤

EXERCISES, PROBLEMS, AND ANSWERS

by deﬁnition, it as follows that each quantum is at most complexity is at most

x 4

1 k−1

401

. Hence, the time

1

(k − 1)(I /4) k−1 + k. Partial Answer to Exercise 6.6.11 The encoding of I can be deﬁned recursively as follows: E(I ) = b0 | E(I1 ) | bk−1 , where E(Ii ) =

E(I2i ) | bi | E(I2i+1 ) if 1 < i < k − 1 quantum of length Ii if k − 1 ≤ i ≤ 2k − 3

I I1 =

2

2

, I2i = Ii , I2i+1 =

bi = I2i+1 mod 2, bk−1

Ii − I2i2 2

, and

I = mod 2. 2

To obtain I , the receiver will recursively compute Ii = I2i2 + (2I2i+1 + bi ) until I1 is determined; then, I = 4I1 + 2bk−1 + b1 . Answer to Exercise 6.6.14 We want to prove that ω(t, k) =

t +q

. Let w = ω(t, k); by deﬁnition, it must q be possible to communicate any element in Zw = {0, 1, . . . , w} using q = k − 1 distinguished quanta requiring at most time t. In other words, ω(t, q + 1) is equal to the number of distinct q-tuples t1 , t2 , . . . , tq of positive integers such that 1≤i≤k ti ≤ t. Given a positive integer x, let Tk [x] denote the number of compositions of x of size q, that is,

xj = x, xj ∈ Z + }|. Tq [x] = |{x1 , x2 , . . . , xq :

x+q −1 , it follows that As Tq [x] = q −1

i+q −1

t +q Tq [i] = = , ω(t, q + 1) = q − 1 q i

i t +q which proves that ω(t, k) = . q

402

SYNCHRONOUS COMPUTATIONS

Answer to Exercise 6.6.15 Let f (|U |, q) = t. First of all we prove that for any solution protocol P for Cq+1 (U ), there exists a partition of U into t + 1 disjoint subsets U0 , U1 , . . . , Ut , such that 1. |Ui | =

i+q −1

t +q −1

, 0 ≤ i < t, |Ut | ≤ , q −1 q −1 2. the time P (x) required by P to communicate x ∈ Ui is P (x) ≥ i. As f (|U |, q) = t, by Equation 6.9, U is the largest set for which the two-party communication problem can always be solved using b = q + 1 transmissions and at most t additional time units. Given a protocol P for Cq+1 (U ), order the elements x ∈ U according to the time P (x) required by P to communicate them; let U¨ be the corresponding ordered set. Deﬁne U¨i to be the subset composed of the elements of U¨ whose deﬁned above, is in ranking, with respect to the ordering

j +q −1 j +q −1 , 0≤j ≤i . As f (|U |, q) = t, it folthe range 0≤j . In other words, in protocol P , the number k−1 of elements that are uniquely identiﬁed using q quanta for a total of j time is j + q − 1 greater than the number Tq [j ] = compositions of j of size k: a q −1 clear contradiction. Hence, for every x ∈ U¨i , P (x) ≥ i, proving part 2. At this point, the rest of the proof easily follows. Answer to Exercise 6.6.17 q+1 The number of distinct assignment of values to q + 1 distinguished bits is 2 . The number of distinct q-tuples t1 , t2 , tq of positive integers such that j tj ≤ t is

t +q q+1 q+1 . ω(t, k) (from 6.9). Therefore, β(t, k) = 2 ω(t, k) = 2 q Partial answer to Exercise 6.6.19 First prove the following: Let µ(|U |, q) = t; for any solution protocol P using k reliable bits to communicate values from U , there exists a partition of U into t + 1

EXERCISES, PROBLEMS, AND ANSWERS

403

disjoint subsets U0 , U1 , . . . , Ut , such that

i + q − 1 t + q − 1 1. |Ui | = 2q+1 , 0 ≤ i < t, and |Ut | = 2q+1 , q −1 q −1 2. the time P (x) required by P to communicate x ∈ Ui is P (x) ≥ i.

Then the rest of the proof easily follows. Answer to Exercise 6.6.44 Hint: Use binary search. Answer to Exercise 6.6.49 Let x be fooled and incorrectly become leader at the end of the j th iteration. According to the algorithm the only way that x has for becoming leader is the following: 1. At time t(x, j ), x starts waiting for f (x, j ). Note that during this time x must not receive any message to become a leader later. 2. At time t(x, j ) + f (x, j ), x sends a “Wait1” message and becomes checking. 3. At time t(x, j ) + f (x, j ) + nx (j ), it receives a Wait1 message and starts the second waiting. Note that during this time, x must not receive any message in order to become a leader later. 4. At time t(x, j ) + f (x, j ) + nx (j ) + h(x, j ), it sends a “Wait2” message and becomes checking-again. 5. At time t(x, j ) + f (x, j ) + g(x, j ) + 2nx (j ), it receives a “Wait2” message and becomes leader. Let y = x and z = x be the entities that originated the “Wait1” and “Wait2” messages, respectively, received by x. Notice that to originate these messages, y and z can not be passive (they might become so later, though). The “Wait1” message is sent by y only after it successfully ﬁnished the waiting f (y, j ) time units. That is, the “Wait1” message will be sent by y at time t(y, j ) + f (y, j ). This message requires d(y, x) unit times to reach x. Therefore, t(x, j ) + f (x, j ) + m(x, j ) = t(y, j ) + f (y, j ) + d(y, x). The “Wait2” message will arrive at x at time t(x, j ) + f (x, j ) + 2m(x, j ) + g(x, j ). By contradiction, let z = y. Consider ﬁrst the case when y is located in the path from z to x. In this case, the “Wait2” message originated by z will reach y before x. If y is still waiting to receive a “Wait1” message, the reception of this not forward the “Wait2” message and “Wait2” message will alert it to something wrong; it will not forward the “Wait2” message to x and send a “Restart” instead, and thus, x will not become leader. Therefore, z is located on the path from y to

404

SYNCHRONOUS COMPUTATIONS

x. In this case, the “Wait1” message originated by y reaches z before arriving to x. As we have assumed that this message will arrive to x, it means that z must have forwarded it; the only way it could have done so is by becoming passive, but in this case z will not originate a Wait2 message, contradicting the assumption we have made. Answer to Exercise 6.6.50 Let x be the entity with the smallest id, and denote this value by i. Entity x will start at time t(x, j ) and would stop waiting at time t(x, j ) + f (x, j ). As the entities start the iteration within time units from each other, for every other entity j t(x, j ) − t(y, j ) ≤ n − 1; as d(x, y) ≤ n − 1, this means that t(x, j ) + f (x, j ) + d(x, y) ≤ f (x, j ) + 2(n − 1). Recall that f is a waiting function; this means that as x has the smallest identity and g(j ) ≥ n, f (x, j ) + 2(n − 1) < f (y, j ) for every other entity y. Thus, t(x, j ) + f (x, j ) + d(x, y) < f (y, j ). That is, x will ﬁnish waiting before anybody else; its message will travel along the ring transforming into passive all other entities and will reach x after nx = n time units. Thus, x will be the only entity starting the second waiting, and its “Wait2” message will reach x again after nx = n time units. Hence, x will validate its guess, become leader, and notify all other entities of termination. Answer to Exercise 6.6.52 We know (Exercises 6.6.50 and 6.6.51) that if n ∈ / ∂(j − 1), then no entity becomes a leader in the (j −1)th iteration. According to the leader election algorithm, if an entity becomes neither leader nor passive during the (j −1) iteration, it becomes active and unconditionally sends an R message for the jth iteration. At this point the jth iteration starts with bounded delays. The proof of this Lemma is based on the proof that is impossible for all the entities in the (j − 1)th iteration become passive and, therefore, no leader is elected and there is no active entities that can send the R message. First, let x be the entity with the smallest ix , called i. And let all the entities become passive in the (j −1)th iteration. Note that according to the algorithm the only way for an entity to become passive is receiving a C message when is in the waiting state, that is, during f (x, j −1) the entity x must receive a C message in order to become passive. Let y denote the entity that originates the C message. The C message will be arriving to x in exactly t(y, j −1 ) + f (y, j −1) + d(y, x) time units. Thus, in order that x becomes passive, it follows that t(x, j − 1) + f (x, j − 1) > t(y, j − 1) + f (y, j − 1) + d(y, x) t(x, j − 1) + i(bj −1 + 1) > t(y, j − 1) + iy (bj −1 + 1) + d(y, x).

EXERCISES, PROBLEMS, AND ANSWERS

405

As i is the smallest value, i < iy and, therefore, i(bj −1 + 1) < iy (bj −1 + 1). Then to hold (3), it must be t(x, j − 1) > t(y, j − 1) + d(y, x), contradicting the fact that all the entities start the (j −1)th iteration with bounded delay. Therefore, it is impossible that all the entities become passive in any iteration. In conclusion, if n ∈ / ∂(j − 1) an R message is sent by an active entity and the next iteration start with bounded delays proving in this way the Lemma 3. Answer to Exercise 6.6.53 Let x be the entity with the smallest value, and let i be that value. Entity x starts executing the protocol at most n − 1 time units after the other entities. It starts the (j + 1)th iteration less than f (x, j ) + 2nx (j ) + h(x, j ) time units after x started the j th iteration. As f (x, j ) + g(x, j ) + 2nx (j ) = 2g(j )i + 2g(j )i + g(j ) − nx (j ) + 2nx (j ) = (4i + 1)g(j ) + nx (j ), the total time required until x becomes leader is at most n−1+

j

((4i + 1)g(j ) + nx (j )).

j =1

As there are also the n − 1 time units before the “Terminate” message notiﬁes all entities, the total time for the algorithm is at most 2(n − 1) +

j

((4i + 1)g(j ) + nx (j )).

j =1

Notice that if g(j ) < nx (j ), then x would detect the anomaly and send a “Restart”; thus, we can assume that in the expression above the actual time spent is Min{g(j ), nx (j )}. Then the above expression becomes: 2(n − 1) + (4i + 2) j j =1 g(j ). Answer to Exercise 6.6.54 j−1 The last iteration is j = g −1 (n); as g is superincreasing, g(j) ≥ i=1 g(j ). The j algorithm terminates in less than 2(n − 1) + (4 imin + 2) j =1 g(j ) time units. j Now, (4imin + 2) j =1 g(j ) ≤ 2(n − 1) + (4imin + 2)2g(j). Answer to Exercise 6.6.60 Sketch: Use a counter, initially set to 0; in each step, set it to the largest of the received counters increased by one and add it to any message sent in that step. When the counter is equal to 2d, stop.

406

SYNCHRONOUS COMPUTATIONS

Answer to Exercise 6.6.68 Use saturation: Each of the two saturated nodes computes its eccentricity; the largest of the two is communicated to their subtrees, starting a “countdown.” When the furthermost entity receives the message, their value becomes simultaneously 0 and they all enter state ﬁring at the same time. This protocol uses at most 3n − 2 signals for the wake-up and saturation and an additional n − 2 messages for the countdown, each containing a value of at most d. The time is at most 2d for wake-up and saturation; at most, additional d time units are needed for the countdown.

BIBLIOGRAPHY [1] P. Alimonti, P. Flocchini, and N. Santoro. Finding the extrema of a distributed multiset of values. Journal of Parallel and Distributed Computing, 37:123–133, 1996. [2] A. Arora, S. Dolev, and M. Gouda. Maintaining digital clocks in step. Parallel Processing Letters, 1(1):11–18, 1991. [3] H. Attiya, M. Snir, and M.K. Warmuth. Computing on an anonymous ring. Journal of the ACM, 35(4):845–875, 1988. [4] A. Bar-Noi, J. Naor, and M. Naor. One bit algorithms. Distributed Computing, 4(1):3–8, 1990. [5] H.L. Bodlaender and G. Tel. Bit optimal election in synchronous rings. Information Processing Letters, 36(1):53–56, 1990. [6] S. Even and S. Rajsbaum. The use of a synchronizer yields maximum computation rate in distributed networks. In 22nd ACM Symposium on Theory of Computing, pages 95–105, 1990. [7] S. Even and S. Rajsbaum. Unison, canon and sluggish clocks in networks controlled by a synchronizer. Mathematical System Theory, 28:421–435, 1995. [8] P. Flocchini. Informazione Strutturata e Calcolo Distribuito. PhD thesis, University of Milan, Milano, Italy, 1995. [9] G.N. Frederickson and N.A. Lynch. Electing a leader in a synchronous ring. Journal of the ACM, 34(1):95–115, 1987. [10] G.N. Frederickson and N. Santoro. Breaking symmetry in synchronous networks. In T. Papatheodorou F. Makedon K. Mehlhorn and P. Spirakis, editors, VLSI Algorithms and Architectures, volume 227 of LNCS, pages 26–33, Loutraki, July 1986. [11] E. Gafni. Improvements in the time complexity of two message-optimal election algorithms. In 4th ACM Symposium on Principles of Distributed Computing, pages 175–185, Minaki, Aug. 1985. [12] M. Gouda and T. Herman. Stabilizing unison. Information Processing Letters, 35(4):171– 175, 1990. [13] A. Israeli, E. Kranakis, D. Krizanc, and N. Santoro. Time-message trade-offs for the weak unison problem. Nordic Journal of Computing, 4(4):317–329, Winter 1997. [14] A. Itai and M. Rodeh. Symmetry breaking in distributed networks. Information and Computation, 88(1):60–87, Sept. 1990.

BIBLIOGRAPHY

407

[15] E. Korach, D. Rotem, and N. Santoro. Distributed algorithms for ﬁnding centers and medians in networks. ACM Transactions on Programming Languages and Systems, 6(3):380– 401, July 1984. [16] J. van Leeuwen, N. Santoro, J. Urrutia, and S. Zaks. Guessing games and distributed computations in synchronous networks. In 14th International Colloquium on Automata, Languages and Programming, pages 347–356, Karlsruhe, 13–17 July 1987. [17] A. Marchetti-Spaccamela. New protocols for the election of a leader in a ring. Theoretical Computer Science, 54(1):53–64, 1987. [18] E.F. Moore. The ﬁring squad synchronization problem. In Sequential Machines: Selected Papers, pages 213–214. Addison-Wesley, 1964. [19] U.-M. O’Reilly and N. Santoro. Asynchronous to synchronous transformations. In 4th International Conference on Principles of Distributed Systems, pages 265–282, Paris, 2000. [20] U.-M. O’Reilly and N. Santoro. Tight bounds for synchronous communication of information using bits and silence. Discrete Applied Mathematics, 129:195–209, 2003. [21] M.H. Overmars and N. Santoro. Improved bounds for electing a leader in a synchronous ring. Algorithmica, 18(2):246–262, June 1997. [22] R.J. Ramirez and N. Santoro. Distributed control of updates in multiplecopy databases: a time optimal algorithm. In 4th Berkeley Conference on Distributed Data Management and Computer Networks, pages 191–207, Berkeley, August 1979. [23] N. Santoro and D. Rotem. On the complexity of distributed elections in synchronous graphs. In 11th International Workshop on Graph-Theoretical Concepts in Computer Science, pages 337–346, 1985. [24] B. Schmeltz. Optimal tradeoff between time and bit complexity in synchronous rings. In 7th Symposium on Theoretical Computer Science, pages 275–284, 1990. [25] P.G. Spirakis and B. Tampakas. Efﬁcient distributed algorithms by using the archimedean time assumption. Informatique Theorique et Applications, 23(1):113–128, 1989. [26] P. Vitanyi. Distributed elections in an archimedean ring of processors. In 16th ACM Symposium on Theory of Computing, pages 542–547, 1984.

CHAPTER 7

Computing in Presence of Faults

7.1 INTRODUCTION In all previous chapters, with few exceptions, we have assumed total reliability, that is, the system is failure free. Unfortunately, total reliability is practically nonexistent in real systems. In this chapter we will examine how to compute, if possible, when failures can and do occur. 7.1.1 Faults and Failures We speak of a failure (or fault) whenever something happens in the systems that deviates from the expected correct behavior. In distributed environments, failures and their causes can be very different in nature. In fact, a malfunction could be caused by a design error, a manufacturing error, a programming error, physical damage, deterioration in the course of time, harsh environmental conditions, unexpected inputs, operator error, cosmic radiations, and so forth. Not all faults lead (immediately) to computational errors (i.e., to incorrect results of the protocol), but some do. So the goal is to achieve fault-tolerant computations, that is, our aim is to design protocols that will proceed correctly in spite of the failures. The unpredictability of the occurrence and nature of a fault and the possibility of multiple faults render the design of fault-tolerant distributed algorithms very difﬁcult and complex, if at all possible. In particular, the more components (i.e., entities, links) are present in the system, the greater is the chance of one or more of them being/becoming faulty. Depending on their cause, faults can be grouped into three general classes: execution failures, that is, faults occurring during the execution of the protocol by an entity; examples of protocol failures are computational errors occurring when performing an action, as well as execution of the incorrect rule. transmission failures, due to the incorrect functioning of the transmission subsystem; examples of transmission faults are the loss or corruption of a transmitted message as well as the delivery of a message to the wrong neighbor.

Design and Analysis of Distributed Algorithms, by Nicola Santoro Copyright © 2007 John Wiley & Sons, Inc.

408

INTRODUCTION

409

component failures, such as the deactivation of a communication link between two neighbors, the shutdown of a processor (and thus of the corresponding entity), and so forth. Note that the same fault can occur because of different causes, and hence classiﬁed differently. Consider, for example, a message that an entity x is supposed to send (according to the protocol) to a neighbor y but never arrives. This fault could have been caused by x failing to execute the “send” operation in the protocol: an execution error; by the loss of the message by the transmission subsystem: a transmission error; or by the link (x, y) going down: a component failure. Depending on their duration, faults are classiﬁed as transient or permanent. A transient fault occurs and then disappears of its own accord, usually within a short period of time. A bird ﬂying through the beam of a microwave transmitter may cause lost bits on some network. A transient fault happens once in a while; it may or may not reoccur. If it continues to reoccur (not necessarily at regular intervals), the fault is said to be intermittent. A loose contact on a connector will often cause an intermittent fault. Intermittent faults are difﬁcult to diagnose. A permanent failure is one that continues to exist until the fault is repaired. Burnout chips, software bugs, and disk head crashes often cause permanent faults. Depending on their geographical “spread”, faults are classiﬁed as localized or ubiquitous. Localized faults occur always in the same region of the system, that is, only a ﬁxed (although a priori unknown) set of entities/links will exhibit a faulty behavior. Ubiquitous faults will occur anywhere in the system, that is, all entities/links will exhibit at some point or another a faulty behavior. Note that usually transient failures are ubiquitous, while intermittent and permanent failures tend to be localized. Clearly no protocol can be resilient to an arbitrary number of faults. In particular, if the entire system collapses, no protocol can be correct. Hence, the goal is to design protocols that are able to withstand up to a certain amount of faults of a given type. Another fact to consider is that not all faults are equally dangerous. The danger of a fault lies not necessarily in the severity of the fault itself but rather in the consequences that its occurrence might have on the correct functioning of the system. In particular, danger for the system is intrinsically related to the notion of detectability. In general, if a fault is easily detected, a remedial action can be taken to limit or circumvent the damage; if a fault is hard or impossible to detect, the effects of the initial fault may spread throughout the network creating possibly irreversible damage. For example, the permanent fault of a link going down forever is obviously more severe than if that link failure is just transient. In contrast, the permanent failure of the link might be more easily detectable, and thus can be taken care of, than the occasional mulfanctioning

410

COMPUTING IN PRESENCE OF FAULTS

of the link. In this example, the less severe fault (the transient one) is potentially more dangerous for the system. With this in mind, when we talk about fault-tolerant protocols and fault-resilient computations, we must always qualify the statements and clearly specify the type and number of faults that can be tolerated. To do so, we must ﬁrst understand what are the limits to the fault tolerance of a distributed computing environment, expressed in terms of the nature and number of faults that make a nontrivial computation (im)possible. 7.1.2 Modeling Faults Given the properties of the system and the types of faults assumed to occur, one would like to know the maximum number of faults that can be tolerated. This number is called the resiliency. To establish the resiliency, we need to be more precise on the types of faults that can occur. In particular, we need to develop a model to describe the failures in the system. Faults, as mentioned before, can be due to execution errors, transmission errors, or component failures; the same fault could be caused by any of those three causes and hence could be in any of these three categories. There are several failure models, each differing on what is the factor “blamed” for a failure. IMPORTANT. Each failure model offers a way of describing (some of the) faults that can occur in the system. A model is not reality, only an attempt to describe it. Component Failure Models The more common and most well known models employed to discuss and study fault tolerance are the component failures models. In all the component failure models, the blame for any fault occurring in the system must be put on a component, that is, only components can fail, and if something goes wrong, it is because one of the involved components is faulty. Depending on which components are blamed, there are three types of component failure models: entity, link, and hybrid failure models. In the entity failure (EF) model, only nodes can fail. For example, if a node crashes, for whatever reason, that node will be declared faulty. In this model, a link going down will be modeled by declaring one of the two incident nodes to be faulty and to lose all the message to and from its neighbor. Similarly, the corruption of a message during transmission must be blamed on one of the two incident nodes that will be declared to be faulty. In the link failure (LF) model, only links can fail. For example, the loss of a message over a link will lead to that link being declared faulty. In this model, the crash of a node is modeled by the crash of all its incident links. The event of an entity computing some incorrect information (because of a execution error) and sending it to a neighbor, will be modeled by blaming the link connecting the entity to the neighbor; in particular, the link will be declared to be responsible for corrupting the content of the message.

INTRODUCTION

411

Crash

Send Omission

Receive Omission

Send/Receive Omission

Byzantine

FIGURE 7.1: Hierarchy of faults in the EF model.

In the hybrid failure (HF) model, both links and nodes can be faulty. Although more realistic, this model is little known and seldom used. NOTE. In all three component failure models, the status faulty is permanent and is not changed, even though the faulty behavior attributed to that component may be never repeated. In other words, once a component is marked with being faulty, that mark is never removed; so, for example, in the link failure model, if a message is lost on a link, that link will be considered faulty forever, even if no other message will ever be lost there. Let us concentrate ﬁrst on the entities failure model. That is, we focus on systems where (only) entities can fail. Within this environment, the nature of the failures of the entities can vary. With respect to the danger they may pose to the system, a hierarchy of failures can be identiﬁed. 1. With crash faults, a faulty entity works correctly according to the protocol, then suddenly just stops any activity (processing, sending, and receiving messages). These are also called fail-stop faults. Such a hard fault is actually the most benign from the overall system point of view. 2. With send/receive omission faults, a faulty entity occasionally loses some received messages or does not send some of the prepared messages. This type of faults may be caused by buffer overﬂows. Notice that crash faults are just a particular case of this type of failure: A crash is a send/receive omission in which all messages sent to and and from that entity are lost. From the point of view of detectability, these faults are much more difﬁcult than the previous one. 3. With Byzantine faults, a faulty entity is not bound by the protocol and can perform any action: It can omit to send or receive any message, send incorrect

412

COMPUTING IN PRESENCE OF FAULTS

information to its neighbors, behave maliciously so as to make the protocol fail. Undetected software bugs often exhibit Byzantine faults. Clearly, dealing with Byzantine faults is going to be much more difﬁcult than dealing with the previous ones. A similiar hierarchy between faults exists in the link as well as in hybrid failures models. Communication Failures Model A totally different model is the communication failure or dynamic fault (DF) model; in this model, the blame for any fault is put on the communication subsystem. More precisely, the communication system can lose, corrupt, and deliver to the incorrect neighbor. As in this model, only the communication system can be faulty, a component fault such as the crash failure of a node, is modeled by the communication system losing all the messages sent to and from that node. Notice that in this model, no mark (permanent or otherwise) is assigned to any component. In the communication failure model, the communication subsystem can cause only three types of faults: 1. An omission: A message sent by an entity is never delivered. 2. An addition: A message is delivered to an entity, although none was sent. 3. A corruption: A message is sent but one with different content is received. While the nature of omissions and corruptions is quite obvious, that of additions is less so. Indeed, it describes a variety of situations. The most obvious one is when sudden noise in the transmission channel is mistaken for transmission of information by the neighbor at the other end of the link. The more important occurrence of additions in sytems is rather subtle, as an addition models the reception of a “nonauthorized message” (i.e., a message not transmitted by any authorized user). In this sense, additions model messages surreptitiously inserted in the system by some outside, and possibly malicious, entity. Spam being sent from an unsuspecting site clearly ﬁts the description of an addition. Summarizing, additions do occur and can be very dangerous. These three types of faults are quite incomparable with each other in terms of danger. The hierarchy comes into place when two or all of these basic fault types can simultaneously occur in the system. The presence of all three types of faults creates what is called a Byzantine faulty behavior. The situation is depicted in Figure 7.2. Clearly, no protocol can tolerate any number of faults of any type. If the entire system collapses, no computation is possible. Thus, when we talk about fault-tolerant protocols and fault-resilient computations, we must always qualify the statements and clearly specify the type and number of faults that can be tolerated. 1

The term “Byzantine” refers to the Byzantine Empire (330–1453 AD), the long-lived eastern component of the Roman Empire whose capital city was Byzantium (now Istanbul), in which endless conspiracies, intrigue, and untruthfulness were alleged to be common among the ruling class.

INTRODUCTION

Omission

Addition

Corruption

Omission + Addition

Omission + Corruption

Addition + Corruption

413

Byzantine

FIGURE 7.2: Hierarchy of combinations of fault types in the DF model.

7.1.3 Topological Factors Our goal is to design protocols that can withstand as many and as dangerous faults as possible and still exhibit a reasonable cost. What we will be able to do depends not only on our ability as designers but also on the inherent limits that the environment imposes. In particular, the impact of a fault, and thus our capacity to deal with it and design fault-tolerant protocols, depends not only on the type and number of faults but also on the communication topology of the system, that is, on the graph G. This is because all nontrivial computations are global, that is, they require the participation of possibly all entities. For this reason, Connectivity is a restriction required for all nontrivial computations. Even when initially existent, in the lifetime of the system, owing to faults, connectivity may cease to hold, rendering correctness impossible. Hence, the capacity of the topological structure of the network to remain connected in spite of faults is crucial. There are two parameters that directly link topology to reliability and fault tolerance: edge connectivity cedge (G) is the minimum number of edges whose removal destroys the (strong) connectivity of G; node connectivity cnode (G) is the minimum number of nodes whose removal destroys the (strong) connectivity of G. NOTE. In the case of a complete graph, the node connectivity is always deﬁned as n − 1. Clearly, the higher the connectivity, the higher the resilience of the system to component failures. In particular, Property 7.1.1 If cedge (G) = k, then for any pair x and y of nodes there are k edge-disjoint paths connecting x to y.

414

COMPUTING IN PRESENCE OF FAULTS

Network G Tree T Ring R Torus T r Hypercube H Complete K

Node Connectivity cnode (G) 1 2 4 log n n−1

Edge Connectivity cedge (G) 1 2 4 log n n−1

Max Degree deg(G) ≤n−1 2 4 log n n−1

FIGURE 7.3: Connectivity of some networks.

Property 7.1.2 If cnode (G) = k, then for any pair x and y of nodes there are k node-disjoint paths connecting x to y. Let us consider some examples of connectivity. A tree T has the lowest connectivity of all undirected graphs: cedge (T ) = cnode (T ) = 1, so any failure of a link or a node disconnects the network. A ring R faces little better as cedge (R) = cnode (R) = 2. Higher connectivity can be found in denser graphs. For example, in a hypercube H , both connectivity parameters are log n. Clearly the highest connectivity is to be found in the complete network K. For a summary, see Figure 7.3. Note that in all connected networks G the node connectivity is not greater than the edge connectivity (Exercise 7.10.1) and neither can be better than the maximum degree: Property 7.1.3 ∀G, cnode (G) ≤ cedge (G) ≤ deg(G) As an example of the impact of edge connectivity on the existence of fault-tolerant solutions, consider the broadcast problem Bcast. Lemma 7.1.1 If k arbitrary links can crash, it is impossible to broadcast unless the network is (k + 1)-edge-connected. Proof. If G is only k-edge-connected, then there are k edges whose removal disconnects G. The failure of those links will make some nodes unreachable from the initiator of the broadcast and, thus, they will never receive the information. By contrast, if G is (k + 1)-edge-connected, then even after k links go down, by Property 7.1.1, there is still a path from the initiator to all other nodes. Hence ﬂooding will correctly work. 䊏 As an example of the impact of node-connectivity on the existence of fault-tolerant solutions, consider the problem of an initiator that wants to broadcast some information, but some of the entities may be down. In this case, we just want the nonfaulty entities to receive the information. Then (Exercise 7.10.2), Lemma 7.1.2 If k arbitrary nodes can crash, it is impossible to broadcast to the nonfaulty nodes unless the network is (k + 1)-node-connected.

INTRODUCTION

415

7.1.4 Fault Tolerance, Agreement, and Common Knowledge In most distributed computations there is a need to have the entities to make a local but coordinated decision. This coordinated decision is called an agreement. For example, in the election problem, every entity must decide whether it is the leader or not. The decision is local but must satisfy some global constraint (only one entity must become leader); in other words, the entities must agree on which one is the leader. For any problem requiring an agreement, the sets of constraints deﬁning the agreement are different. For example, in minimum ﬁnding, the constraint is that all and only the entities with the smallest input value must become minimum. For example, in ranking when every entity has an initial data item, the constraint is that the value decided by each entity is precisely the rank of its data item in the overall distributed set. When there are no faults, reaching these agreements is possible (as we have seen in the other chapters) and often straightforward. Unfortunately, the picture changes dramatically in presence of faults. Interestingly, the impact that faults have on problems requiring agreement for their solution has common traits, in spite of the differences of the agreement constraints. That is, some of the impact is the same for all these problems. For these reasons, we consider an abstract agreement problem where this common impact of faults on agreements is more evident. In the p-Agreement Problem (Agree(p)), each entity x has an input value v(x) from some known set (usually {0, 1}) and must terminally decide upon a value d(x) from that set within a ﬁnite amount of time. Here, “terminally” means that once made, the decision cannot be modiﬁed. The problem is to ensure that at least p entities decide on the same value. Additional constraints, called nontriviality (or sometimes validity constraints), usually exist on the value to be chosen; in particular, if all values are initially the same, the decision must be on that value. This nontriviality constraint rules out default-type solutions (e.g., “always choose 0”). Depending on the value of p, we have different types of agreement problems. Of particular interest is the case of p = n2 + 1 that is called strong majority. When p = n, we have the well known Unanimity or Consensus Problem (Consensus) in which all entities must decide on the same value, that is, ∀x, y ∈ E, d(x) = d(y).

(7.1)

The consensus problem occurs in many different applications. For example, consider an aircraft where several sensors are used to decide if the moment has come to drop a cargo; it is possible that some sensors detect “yes” while others “not yet.” On the basis of these values, a decision must be made on whether or not the cargo is to be dropped now. A solution strategy for our example is to drop the cargo only if all sensors agree; another is to decide for a drop as soon as at least one of the sensors indicates so. Observe that the ﬁrst solution corresponds to computing the AND of the sensors’ values; in the consensus problem this solution corresponds to each entity x setting d(x) = AND({v(y) : y ∈ E}). The second solution consists of determining the

416

COMPUTING IN PRESENCE OF FAULTS

OR of those values, that is, d(x) = OR({v(y) : y ∈ E}). Notice that in both strategies, if the initial values are identical, each entity chooses that value. Another example is in distributed database systems, where each site (the entity) of the distributed database must decide whether to accept or drop a transaction; in this case, all sites will agree to accept the transaction only if no site rejects the transaction. The same solutions strategy apply also in this case. Summarizing, if there are no faults, consensus can be easily achieved (e.g., by computing the AND or the OR of the values). Lower forms of agreement, that is, when p < n, are even easier to resolve. In presence of faults, the situation changes drastically and even the problem must be restated. In fact, if an entity is faulty, it might be unable to participate in the computation; even worse, its faulty behavior might be an active impediment for the computation. In other words, as faulty entities cannot be required to behave correctly, the agreement constraint can hold only for the nonfaulty entities. So, for example, a consensus problem we are interested in is Entity-Fault-Tolerant Consensus (EFTConsensus). Each nonfaulty entity x has an input value v(x) and must terminally decide upon a value d(x) within a ﬁnite amount of time. The constraints are 1. agreement: all nonfaulty entities decide on the same value; 2. nontriviality: if all values of the nonfaulty elements are initially the same, the decision must be on that value. Similarly, we can deﬁne lower forms (i.e., when p < n) of agreement in presence of entity failures (EFT-Agree(p)). For simplicity (and without any loss of generality), we can consider the Boolean case, that is when the values are all in {0, 1}. Possible solutions to this problem are, for example, computing AND or the OR of the input values of the nonfaulty entities, or the value of an elected leader. In other words, consensus (fault tolerant or not) can be solved by solving any of a variety of other problems (e.g., function evaluation, leader election, etc.). For this reason, the consensus problem is elementary: If it cannot be solved, then none of those other problems can be solved either. Reaching agreement, and consensus in particular, is strictly connected with the problem of reaching common knowledge. Recall (from Section 1.8.1) that common knowledge is the highest form of knowledge achievable in a distributed computing environment. Its connection to consensus is immediate. In fact, any solution protocol P to the (fault-tolerant) consensus problem has the following property: As it leads all (nonfaulty) entities to decide on the same value, say d, then within ﬁnite time the value d becomes common knowledge among all the nonfaulty entities. By contrast, any (fault-tolerant) protocol Q that creates common knowledge among all the nonfaulty entities can be used to make them decide on a same value and thus achieve consensus. IMPORTANT. This implies that common knowledge is as elementary as consensus: If one cannot be achieved, neither can be other.

THE CRUSHING IMPACT OF FAILURES

417

7.2 THE CRUSHING IMPACT OF FAILURES In this section we will examine the impact that faults have in distributed computing environments. As we will see, the consequences are devastating even when faults are limited in quantity and danger. We will establish these results assuming that the entities have distinct values (i.e., under restriction ID); this makes the bad news even worse. 7.2.1 Node Failures: Single-Fault Disaster In this section we examine node failures. We consider the possibility that entities may fail during the computation and we ask under what conditions the nonfaulty entities may still carry out the task. Clearly, if all entities fail, no computation is possible; also, we have seen that some faults are more dangerous than others. We are interested in computations that can be performed, provided that at most a certain number f of entities fail, and those failures are of a certain type τ (i.e., danger). We will focus on achieving fault-tolerant consensus (problem EFT-Consensus described in Section 7.1.4), that is, we want all nonfailed entities to agree on the same value. As we have seen, this is an elementary problem. A ﬁrst and immediate limitation to the possibility of achieving consensus in presence of node failures is given by the topology of the network itself. In fact, by Lemma 7.1.2, we know that if the graph is not (k + 1)-node-connected, a broadcast to nonfaulty entities is impossible if k entities can crash. This means that Lemma 7.2.1 If k ≥ 1 arbitrary entities can possibly crash, fault-tolerant consensus can not be achieved if the network is not (k + 1)-node-connected. This means, for example, that in a tree, if a node goes down, consensus among the others cannot be achieved. Summarizing, we are interested in achieving consensus, provided that at most a given number f of entities fail, those failures are of at most a certain type τ of danger, and the node-connectivity of the network cnode is high enough. In other words, the problem is characterized by those three paramenters, and we will denote it by EFTConsensus(f, τ, cnode ). We will start with the simplest case: f = 1, that is, at most one entity fails; τ = crash, that is, if an entity fails, it will be in the most benign way; cnode = n − 1, that is, the topology is not a problem as we are in the complete graph. In other words, we are in a complete network (every entity is connected to every other entity); at most one entity will crash, leaving all the other entities connected to each other. What we want is that these other entities agree on the same value, that is, we want to solve problem EFT-Consensus(1, crash, n − 1). Unfortunately,

418

COMPUTING IN PRESENCE OF FAULTS

Theorem 7.2.1 solvable.

(Single-Fault Disaster) EFT-Consensus (1, crash, n − 1) is un-

In other words, fault-tolerant consensus cannot be achieved even under the best of conditions. This really means that it is impossible to design fault-tolerant solutions for practically all important problems, as each could be used to achieve fault-tolerant consensus. Before proceeding further with the consequences of this result, also called FLP Theorem (after the initials of those who ﬁrst proved it), let us see why it is true. What we are going to do is to show that no protocol can solve this problem, that is, no protocol always correctly terminate within ﬁnite time if an entity can crash. We will prove it by contradiction. We assume that a correct solution protocol P indeed exists and then show that there is an execution of this protocol in which the entities fail to achieve consensus in ﬁnite time (even if no one fails at all). The proof is neither simple nor complex. It does require some precise terminology and uses some constructs that will be very useful in other situations also. We will need not only to describe the problem but also to deﬁne precisely the entire environment, including executions, events, among others. Some of this notation has already been introduced in Section 1.6. Terminology Let us start with the problem. Each entity x has an input register Ix , a write-once output register Ox , as well as unlimited internal storage. Initially, the input register of an entity is a value in {0, 1}, and all the output registers are set to the same value b ∈ / {0, 1}; once a value dx ∈ {0, 1} is written in Ox , the content of that register is no longer modiﬁable. The goal is to have all nonfailed entities set, in ﬁnite time, their output registers to the same value d ∈ {0, 1}, subject to the nontriviality condition (i.e., if all input values are the same, then d must be that value). Let us consider next the status of the system and the events being generated during an execution of the solution protocol P . An entity reacts to external events by executing the actions prescribed by the protocol P . Some actions can generate events that will occur later. Namely, when an entity x sends a message, it creates the future event of the arrival of that message; similarly, when an entity sets the alarm clock, it creates the future event of that alarm ringing. (Although an entity can reset its clock as part of its processing, we can assume, without loss of generality, that each alarm will always be allowed to ring at the time it was originally set for.) In other words, as described in Chapter 1, at any time t during the execution of a protocol, there is a set Future(t) of the events that have been generated so far but have not happened yet. Recall that initially, Future(0) contains only the set of the spontaneous events. To simplify the discussion, we assume that all entities are initiators (i.e., the set Future(0) contains an impulse for each entity), and we will treat both spontaneous events and the ringing of the alarm clocks as the same type of events and call them timeouts. We represent by (x, M) the event of x receiving message M, and by (x, ∅) the event of a timeout occurring at x.

THE CRUSHING IMPACT OF FAILURES

419

As we want to describe what happens to the computation if an entity fails by crashing, we add special system events called crashes, one per entity, to the initial set of events Future(0), and denote by (x, crash) the crash of entity x. As we are interested only in executions where there is at most one crash, if event (x, crash) occurs at time t, then all other crash events will be removed from Future(t). Furthermore, if x crashes, all the messages sent to x but not arrived yet will no longer be processed; Similarly, any timeout set by x but not occurred yet, will no longer occur. In other words, if event (x, crash) occurs at time t, all events (arrivals and timeouts) involving x will be removed from all Future(t ) with t ≥ t. Recall from Section 1.6 that the internal state of an entity is the value of all its registers and internal storage. Also recall that the conﬁguration C(t) of the system at time t is a snapshot of the system at time t; it contains the internal state of each entity and the set Future(t) of the future events that have been generated so far. A conﬁguration is nonfaulty if no crash event has occured so far, faulty otherwise. Particular conﬁgurations are the initial conﬁguration, when all processes are at their initial state and Future is composed of all and only the spontaneous and crash events; by deﬁnition, all initial conﬁgurations are nonfaulty. When an arrival or a timeout event occurs at x, x will act according to the protocol P : It will perform some local processing (thus changing its internal state); it might send some messages and set up its alarm clock; in other words, there will be a change in the conﬁguration of the system (because event has been removed from Future, the internal state of x has changed, and some new events have been possibly added to Future). Clearly the conﬁguration changes also if the event is a crash; notice that this event can occur only if no crash has occured before. Regardless of the nature of event , we will denote the new conﬁguration as (C) where C was the conﬁguration when the event occurred; we will say that is applicable to C and that the conﬁguration (C) is reachable from C. We can extend this notation and say that a sequence of events ψ = 1 2 . . . k is applicable to conﬁguration C if k is applicable to C, and k−1 is applicable to k (C), and k−2 is applicable to k−1 (k (C)), . . ., and 1 is applicable to 2 (. . . (k (C)) . . .); we will say that the resulting conﬁguration C = 1 (2 (. . . (k (C)) . . .)) = ψ(C) is reachable from C. If an entity x sets the output register Ox to either 0 or 1, we say that x has decided on that value, and that state is called a decision state. The output register value cannot be changed after the entity has reached a decision state, that is, once x has made a decision, that decision cannot be altered. A conﬁguration where all nonfailed entities have decided on the same value is called a decision conﬁguration; depending on the value, we will distinguish between a 0-decision and a 1-decision conﬁguration. Notice that once an entity makes a decision it cannot change it; hence, all conﬁgurations reachable by a 0-decision conﬁguration are also 0-decision (similarly in the case of 1-decision). Consider a conﬁguration C and the set C(C) of all conﬁgurations reachable from C. If all decision conﬁgurations in this set are 0-decision (respective 1-decision), we say that C ia 0-valent (respective 1-valent); in other words, in a v-valent conﬁguration, whatever happens, the decision is going to be on v. If, instead, there are both 0-decision

420

COMPUTING IN PRESENCE OF FAULTS

C y1

y2

C1

C2

y1

y2

C3

FIGURE 7.4: Commutativity of disjoint sequences of events.

and 1-decision conﬁgurations in C(C), then we say that C is bivalent; in other words, in a bivalent conﬁgurations, which value is going to be chosen depends on the future events. An important property of sequences of events is the following. Suppose that from some conﬁguration C, the sequences of events ψ1 and ψ2 lead to conﬁgurations C1 and C2 , respectively. If the entities affected by the events in ψ1 are all different from those affected by the events in ψ2 , then ψ2 can be applied to C1 and ψ1 to C2 , and both lead to the same conﬁguration C3 (see Figure 7.4). More precisely, Lemma 7.2.2

Let ψ1 and ψ2 be sequences of events applicable to C such that

1. the sets of entities affected by the events in ψ1 and ψ2 , respectively, are disjoint; and 2. at most one of ψ1 and ψ2 includes a crash event. Then, both ψ1 ψ2 and ψ2 ψ1 are applicable to C. Furthermore, ψ1 (ψ2 (C)) = ψ2 (ψ2 (C)). If a conﬁguration is reachable from some initial conﬁguration, it will be called accessible; we are interested only in accessible conﬁgurations. Consider an accessible conﬁguration C; a sequence of events applicable to C is deciding if it generates a decision conﬁguration; it is admissible if all messages sent to nonfaulty entities are eventually received. Clearly, we are interested only in admissible sequences. Proof of Impossibility Let us now proceed with the proof of Theorem 7.2.1. By contradiction, assume that there is a protocol P that correctly solves the problem EFT-Consensus(1, crash, n − 1), that is, in every execution of P in a complete graph with at most one crash, within ﬁnite time all nonfailed entities decide on the same

THE CRUSHING IMPACT OF FAILURES

421

value (subject to the nontriviality condition). In other words, if we consider all the possible executions of P , every admissible sequence of events is deciding. The proof involves three steps. We ﬁrst prove that among the initial conﬁgurations, there is at least one that is bivalent (i.e., where, depending on the future events, both a 0 and a 1 decision are possible). We then prove that starting from a bivalent conﬁguration, it is always possible to reach another bivalent conﬁguration. Finally, using these two results, we show how to construct an inﬁnite admissible sequence that is not deciding, contradicting the fact that all admissible sequence of events in the execution of P are deciding. Lemma 7.2.3

There is a bivalent initial conﬁguration.

Proof. By contradiction, let all initial conﬁgurations be univalent, that is, either 0- or 1-valent. Because of the nontriviality condition, we know that there is at least one 0-valent initial conﬁguration (the one where all input values are 0) and one 1valent initial conﬁguration (the one where all input values are 0). Let us call two initial conﬁgurations adjacent if they differ only in the initial value of a single entity. For any two initial conﬁgurations C and C , it is always possible to ﬁnd a chain of initial conﬁgurations, each adjacent to the next, starting with C and ending with C . Hence, in this sequence there exists a 0-valent initial conﬁguration C 0 adjacent to a 1-valent initial conﬁguration C 1 . Let x be the entity in whose initial value they differ. Now consider an admissible deciding sequence ψ for C 0 in which the ﬁrst event is (crash, x). Then, ψ can be applied also to C 1 , and the corresponding conﬁgurations at each step of the sequence are identical except for the internal state of entity x. As the sequence is deciding, eventually the same decision conﬁguration is reached. If it is 1-decision, then C 0 is bivalent; otherwise, C 1 is bivalent. In either case, the assumed nonexistence of a bivalent initial conﬁguration is contradicted. 䊏 Lemma 7.2.4 Let C be a nonfaulty bivalent conﬁguration, and let = (x, m) be a noncrash event that is applicable to C. Let A be the set of nonfaulty conﬁgurations reachable from C without applying , and let B = (A) = {(A) | A ∈ A and is applicable to A} (See Figure 7.5). Then, B contains a nonfaulty bivalent conﬁguration. Proof. First of all, observe that as is applicable to C, by deﬁnition of A and because of the unpredictability of communication delays, is applicable to every A ∈ A. Let us now start the proof. By contradiction, assume that every conﬁguration B ∈ B is univalent. In this case, B contains both 0-valent and 1-valent conﬁgurations (Exercise 7.10.4). Call two conﬁgurations neighbors if one is reachable from the other after a single event, and x-adjacent if they differ only in the internal state of entity x. By an easy induction (Exercise 7.10.5), there exist two x-adjacent (for some entity x) neighbors A0 , A1 ∈ A such that D0 = (A0 ) is 0-valent and D1 = (A1 ) is 1-valent. Without loss of generality, let A1 = (A0 ) where = (y, m ). Case I. If x = y, then D1 = (D0 ) by Lemma 7.2.2. This is impossible as any successor of a 0-valent conﬁguration is also 0-valent (see Figure 7.6).

422

COMPUTING IN PRESENCE OF FAULTS

C

(C)

A1

A2

...

Ai

A

(A1)

(A2)

...

(Ai)

B

FIGURE 7.5: The situation of Lemma 7.2.4.

Case II. If x = y, then consider the two conﬁgurations E0 = cx (D0 ) and E1 = cx (D1 ), where cx = (x, crash); as both and are noncrash events involving x, and the occurrence of cx removes from F uture all the future events involving x, it follows that E0 and E1 are x-adjacent. Therefore, if we apply to both the same sequence of events not involving x, they will remain x-adjacent. As P is correct, there must be a ﬁnite sequence ψ of (noncrash) events not involving x that, starting from E0 , reaches a decision conﬁguration; as E0 is 0-valent, ψ(E0 ) is 0-decision (see Figure 7.7). As the events in ψ are noncrash and do not involve x, they are applicable also to E1 and ψ(E0 ) and ψ(E1 ) are x-adjacent. This means that all entities other than x have the same state in ψ(E0 ) and in ψ(E1 ); hence, also ψ(E1 ) is 0-decision. As E1 is 1-valent,

A0

A1

D0

D1

FIGURE 7.6: The situation in Case 1 of Lemma 7.2.4.

423

THE CRUSHING IMPACT OF FAILURES

A0

cx

D0

y E0

[0]

A1

[1]

∈A

[0]

[0]

cx

D1

y E1 [1]

y (E 0 )

y (E 1 ) [1]

∈B

FIGURE 7.7: The situation in Case 2 of Lemma 7.2.4. The valency of the conﬁguration, if known, is in square brackets.

ψ(E1 ) is also 1-valent, a contradiction. So B contains a bivalent conﬁguration; as, by deﬁnition, B is only composed of nonfaulty conﬁgurations, the lemma follows. 䊏 Any deciding sequence ψ of events from a bivalent initial conﬁguration goes to a univalent conﬁguration, so there must be some single event in that sequence that generates a univalent conﬁguration from a bivalent one; it is such an event that determines the eventual decision value. We now show that using Lemmas 7.2.4 and 7.2.3 as tools, it is always possible to ﬁnd a fault-free execution that avoids such events, creating a fault-free admissible but nondeciding sequence. We ensure that the sequence is admissible and nondeciding in the following way. 1. We maintain a queue Q of entities, initially in an arbitrary order. 2. We remove from the set of initial events all the crash events, that is, we consider only fault-free executions. 3. We maintain the future events sorted (in increasing order) according to the time they were originated. 4. We construct the sequence in stages as follows: (a) The execution begins in a bivalent initial conﬁguration Cb whose existence is assured by Lemma 7.2.3. (b) Starting stage i from a bivalent conﬁguration C, say at time t, consider the ﬁrst entity x in the queue that has an event in Future(t). Let be the ﬁrst event for x in Future(t). (c) By Lemma 7.2.4, there is a bivalent conﬁguration C reachable from C by a sequence of events, say ψ, in which is the last event applied. The sequence for stage i is precisely this sequence of events ψ. (d) We execute the constructed sequence of events, ending in a bivalent conﬁguration. (e) We move x and all preceeding entities to the back of the queue and start the next stage.

424

COMPUTING IN PRESENCE OF FAULTS

In any inﬁnite sequence of such stages every entity comes to the front of the queue inﬁnitely many times and receives every message sent to it. The sequence of events so constructed is therefore admissible. As each stage starts and ends in a bivalent conﬁguration, a decision is never reached. The sequence of events so constructed is therefore nondeciding. Summarizing, we have shown that there is an execution in which protocol P never reaches a decision, even if no entity crashes. It follows that P is not a correct solution to our consensus problem. 7.2.2 Consequences of the Single-Fault Disaster The Single-Failure Disaster result of Theorem 7.2.1 dashes any hope for the design of fault-tolerant distributed solution protocols for nontrivial problems and tasks. Because the consensus problem is an elementary one, the solution of almost every nontrivial distributed problem can be used to solve it, but as consensus cannot be solved even if just a single entity may crash, also all those other problems cannot be solved if there is the possibility of failures. The negative impact of this fact must not be underestimated; its main consequence is that it is impossible to design fault-tolerant communication software. This means that to have fault tolerance, the distributed computing environment must have additional properties. In other words, while in general not possible (because of Theorem 7.2.1), some degree of fault tolerance might be achieved in more restricted environments. To understand which properties (and thus restrictions) would sufﬁce we need to examine the proof of Theorem 7.2.1 and to understand what are the particular conditions inside a general distributed computing environment that make it work. Then, if we disable one of these conditions (by adding the appropriate restriction), we might be able to design a fault-tolerant solution. The reason why Theorem 7.2.1 holds is that, as communication delays are ﬁnite but unpredictable, it is impossible to distinguish between a link experiencing very long communication delays and a failed link. In our case, the crash failure of an entity is equivalent to the simultaneous failure of all its links. So, if entity x is waiting for a reply from y and it has not received one so far, it cannot decide whether y has crashed or not. It is this “ambiguity” that leads, in the proof, to the construction of an admissible but nondeciding inﬁnite sequence of events. This means that to disable that proof we need to ensure that this fact (i.e., this “ambiguity”) cannot occur. Let us see how this can be achieved. First of all observe that if communication delays were bounded and clock synchronized, then no ambiguity would occur: As any message would take at most ⌬ time, if entity x sends a message to y and does not receive the expected reply from y within 2⌬ time, it can correctly decide that y has crashed. This means that, in 2

Recall that communication delays include both transmission and processing delays.

LOCALIZED ENTITY FAILURES: USING SYNCHRONY

425

synchronous systems, the proof of Theorem 7.2.1 does not hold; in other words, the restrictions Bounded Delays and Synchronized Clocks together disable that proof. Next observe that the reason why in a synchronous environment the ambiguity is removed is because the entities can use timeouts to reliably detect if a crash failure has occurred. Indeed, the availability of any reliable fault detector would remove any ambiguity and thus disable that proof of Theorem 7.2.1. In other words, either restriction Link-Failure Detection or restriction Node-Failure Detection would disable that proof even if communication delays are unbounded. Observing the proof, another point we can make is that it assumes that all initial bivalent conﬁguration are nonfaulty, that is, the fault has not occurred yet. This is necessary in order to give the “adversary” the power to make an entity crash when most appropriate for the proof. (Simple exercise question : Where in the proof does the adversary exercise this power?) If the crash has occurred before the start of the execution, the adversary loses this power. It is actually sufﬁcient that the faulty entity crashes before it sends any message, and the proof does no longer hold. This means that it might still be possible to tolerate some crashes if they have already occurred, that is, they occur before the faulty entities send messages. In other words, the restriction Partial Reliability stating that no faults will occur during the execution of the protocol would disable the proof, even if communication delays are unbounded and there are no reliable fault detectors. Notice that disabling the proof we used for Theorem 7.2.1 does not imply that the Theorem does not hold; indeed a different proof could still work. Fortunately, in those restricted environments we have just indicated that the entire Theorem 7.2.1 is no longer valid, as we will see later. Finally, observe that the unsolvability stated by Theorem 7.2.1 means that there is no deterministic solution protocol. It does not, however, rule out randomized solutions, that is, protocols that use randomization (e.g., ﬂip of a coin) inside the actions. The main drawback of randomized protocols is that they do not offer any certainty: Either termination is not guaranteed (except with high probability) or correctness is not guaranteed (except with high probability). Summarizing, the Single-Failure Disaster result imposes a dramatic limitation on the design of fault-tolerant protocols. The only way around (possibly) is by substantially restricting the environment: investing in the software and hardware necessary to make the system fully synchronous; constructing reliable fault detectors (unfortunately, none exists so far except in fully synchronous systems); or, in the case of crash faults only, ensuring somehow that all the faults occur before we start, that is, partial reliability. Alternatively, we can give up certainty on the outcome and use randomization.

7.3 LOCALIZED ENTITY FAILURES: USING SYNCHRONY In fully synchronous environment, the proof of the Single-Failure Disaster theorem does not hold. Indeed, as we will see, synchronicity allows a high degree of fault tolerance.

426

COMPUTING IN PRESENCE OF FAULTS

Recall from Chapter 6 that a fully synchronous system is deﬁned by two restrictions: Bounded Delays and Synchronized Clocks. We can actually replace the ﬁrst restriction with the Unitary Delays one, without any loss of generality. These restrictions together are denoted by Synch. We consider again the fault-tolerant consensus problem EFT-Consensus (introduced in Section 7.1.4) in the complete graph in case of component failures, and more speciﬁcally we concentrate on entity failures, that is, the faults are localized (i.e., restricted) to a set of entities (eventhough we do not know beforehand which they are). The problem asks for all the nonfaulty entities, each starting with an initial value v(x), to terminally decide on the same value in ﬁnite time, subject to the nontriviality condition: If all initial values are the same, the decision must be on that value. We will see that if the environment is fully synchronous, under some additional restrictions, the problem can be solved even when almost one third of the entities are Byzantine. In the case of crash failures, we can actually solve the problem tolerating any number of failures. 7.3.1 Synchronous Consensus with Crash Failures In a synchronous system in which the faults are just crashes of entities, under some restrictions, consensus (among the nonfailed entities) can be reached regardless of the number f of entities that may crash. The restrictions considered here are Additional Assumptions 1. 2. 3. 4. 5.

Connectivity, Bidirectional Links; Synch; the network is a complete graph; all entities start simultaneously; the only type of failure is entity crash.

Note that an entity can crash while performing an action, that is, it may crash after sending some but not all the messages requested by the action. Solution Protocols In this environment there are several protocols that achieve consensus tolerating up to f ≤ n − 1 crashes. Almost all of them adopt the same simple mechanism, Tell All(T ), where T is an input parameter. The basic idea behind the mechanism is to collect at each nonfaulty entity enough information so that all nonfaulty entities are able to make the same decision by a given time. Mechanism Tell All (T ) At each time step t ≤ T , every nonfailed entity x sends to all its neighbors a message containing a “report” on everything it knows and waits for a similar message from each of them.

LOCALIZED ENTITY FAILURES: USING SYNCHRONY

427

TellAll-Crash. begin for t = 0, . . . , f do compute rep(x, t); send rep(x, t) to N (x); endfor Ox := rep(x, f + 1); end

FIGURE 7.8: Protocol TellAll-Crash.

If x has not received a message from neighbor y by time t + 1, it knows that y has crashed; if it receives a message from y, it will know a “report” on what y knew at time t (note that in case of Byzantine faults, this “report” could be false). For the appropriate choice of T and with the appropriate information sent in the “report,” this mechanism enables the nonfaulty entities to reach consensus. The actual value of T and the nature of the report depend on the types and number of faults the protocol is supposed to tolerate. Let us now see a fairly simple consensus protocol, called TellAll-Crash and on the basis of this mechanism, that tolerates up to f ≤ n − 1 crashes. The algorithm is just mechanism Tell All where T = f and the “report” consists of the AND function of all the values seen so far. More precisely, rep(x, t) =

if t = 0 , AND(rep(x, t − 1), M(x1 , t), . . . , M(xn−1 , t)) otherwise

v(x)

(7.2)

where x1 , . . . , xn−1 are the neighbors of x and M(xi , t) denotes the message received by x from xi at time t if any, otherwise M(xi , t) = 1. The protocol is shown in Figure 7.8. To see how and why protocol TellAll-Crash works, let us make some observations. Let F be the set of enties that crashed before or during the execution of the protocol, and S the others. Clearly, |F | ≤ f and |F | + |S| = n.

Property 7.3.1 on 1.

If all entities start with initial value 1, all entities in S will decide

Property 7.3.2 If an entity x ∈ S has or receives a 0 at time t ≤ f , then all entities in S will receive a 0 at time t + 1. Property 7.3.3 If an entity x ∈ S has or receives a 0 during the execution of the protocol, it will decide on 0.

428

COMPUTING IN PRESENCE OF FAULTS

These three facts imply that all nonfailed entities will decide on 0 if at least one of them has initial value 0 and will decide on 1 if all entities have initially 1. The only case left to consider is when all entities in S have initially 1 but some entities in F have initially 0. If any of the latter does not crash in the ﬁrst step, by time t = 1 all entities in S will receive 0 and thus decide on 0 at time f + 1. This means that the nonfailed entities at time t = f + 1 will all decide on 0 unless 1. up to time f they have seen and received only 1; and 2. at time f + 1 some (but not all) of them receive 0. In fact, in such a case, as the execution terminates at time f + 1, there is no time for the nonfailed entities that have seen 0 to tell the others. Can this situation occur in reality ? For this situation to occur, the 0 must have been sent at time f by some entity yf ; note that this entity must be in F and crash in this step, sending the 0 only to some of its neighbors (otherwise all entities in S and not just some would have received 0 at time f + 1). Also, yf must have initially had 1 and received 0 only at time f (otherwise it would have sent it before and as it had not crashed yet, everybody would have received it). Let yf −1 be one of the entities that sent the 0 received by yf at time f ; note that this entity must be in F and crashed in that step, sending the 0 only to yf and other entities not in S (otherwise all entities in S would receive 0 by time f + 1). Also, yf −1 must have initially had 1 and received

Nicola Santoro Carleton University, Ottawa, Canada

WILEY-INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION

DESIGN AND ANALYSIS OF DISTRIBUTED ALGORITHMS

DESIGN AND ANALYSIS OF DISTRIBUTED ALGORITHMS

Nicola Santoro Carleton University, Ottawa, Canada

WILEY-INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION

Copyright © 2007 by John Wiley & Sons, Inc. All rights reserved Published by John Wiley & Sons, Inc., Hoboken, New Jersey Published simultaneously in Canada No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permission. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and speciﬁcally disclaim any implied warranties of merchantability or ﬁtness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of proﬁt or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic formats. For more information about Wiley products, visit our web site at www.wiley.com. Library of Congress Cataloging-in-Publication Data: Santoro, N. (Nicola), 1951Design and analysis of distributed algorithms / by Nicola Santoro. p. cm. – (Wiley series on parallel and distributed computing) Includes index. ISBN-13: 978-0-471-71997-7 (cloth) ISBN-10: 0-471-71997-8 (cloth) 1. Electronic data processing–Distributed processing. 2. Computer algorithms. QA76.9.D5.S26 2007 005.1–dc22 2006011214 Printed in the United States of America 10 9 8 7 6 5 4 3 2 1

I. Title.

II. Series.

To my favorite distributed environment: My children Monica, Noel, Melissa, Maya, Michela, Alvin.

CONTENTS

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xiv

1. Distributed Computing Environments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Axioms and Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Cost and Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Amount of Communication Activities . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 An Example: Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 States and Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Time and Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.2 States and Conﬁgurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Problems and Solutions () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8 Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8.1 Levels of Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8.2 Types of Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9 Technical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9.1 Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9.2 Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9.3 Communication Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.10 Summary of Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.11 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.12 Exercises, Problems, and Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.12.1 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.12.2 Answers to Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 4 4 5 6 9 9 10 10 14 14 16 17 19 19 21 22 22 23 24 25 25 26 26 27

2. Basic Problems And Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Cost of Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Broadcasting in Special Networks . . . . . . . . . . . . . . . . . . . . . . . . . .

29 29 29 30 32 vii

viii

CONTENTS

2.2 Wake-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Generic Wake-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Wake-Up in Special Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Depth-First Traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Hacking () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Traversal in Special Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Considerations on Traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Practical Implications: Use a Subnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Constructing a Spanning Tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 SPT Construction with a Single Initiator: Shout . . . . . . . . . . . . . . 2.5.2 Other SPT Constructions with Single Initiator. . . . . . . . . . . . . . . . 2.5.3 Considerations on the Constructed Tree . . . . . . . . . . . . . . . . . . . . . 2.5.4 Application: Better Traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.5 Spanning-Tree Construction with Multiple Initiators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.6 Impossibility Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.7 SPT with Initial Distinct Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Computations in Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Saturation: A Basic Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Minimum Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.3 Distributed Function Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.4 Finding Eccentricities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.5 Center Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.6 Other Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.7 Computing in Rooted Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Summary of Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.2 Summary of Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Exercises, Problems, and Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.2 Problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.3 Answers to Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36 36 37 41 42 44 49 50 51 52 53 58 60 62

3. Election . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Impossibility Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Additional Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Solution Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Election in Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Election in Rings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 All the Way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99 99 99 100 101 102 104 105

62 63 65 70 71 74 76 78 81 84 85 89 89 90 90 91 91 95 95

CONTENTS

ix

3.3.2 As Far As It Can . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Controlled Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Electoral Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.5 Stages with Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.6 Alternating Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.7 Unidirectional Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.8 Limits to Improvements () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.9 Summary and Lessons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Election in Mesh Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Meshes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Tori . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Election in Cube Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Oriented Hypercubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Unoriented Hypercubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Election in Complete Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Stages and Territory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Surprising Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.3 Harvesting the Communication Power . . . . . . . . . . . . . . . . . . . . . Election in Chordal Rings () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Chordal Rings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.2 Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Universal Election Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.1 Mega-Merger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.2 Analysis of Mega-Merger. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.3 YO-YO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.4 Lower Bounds and Equivalences . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises, Problems, and Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10.3 Answers to Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

109 115 122 127 130 134 150 157 158 158 161 166 166 174 174 174 177 180 183 183 184 185 185 193 199 209 212 214 214 220 222

4. Message Routing and Shortest Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Shortest Path Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Gossiping the Network Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Iterative Construction of Routing Tables . . . . . . . . . . . . . . . . . . . 4.2.3 Constructing Shortest-Path Spanning Tree . . . . . . . . . . . . . . . . . 4.2.4 Constructing All-Pairs Shortest Paths . . . . . . . . . . . . . . . . . . . . . 4.2.5 Min-Hop Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.6 Suboptimal Solutions: Routing Trees . . . . . . . . . . . . . . . . . . . . . . 4.3 Coping with Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Adaptive Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

225 225 226 226 228 230 237 240 250 253 253

3.4

3.5

3.6

3.7

3.8

3.9 3.10

x

CONTENTS

4.3.2 Fault-Tolerant Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 On Correctness and Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Routing in Static Systems: Compact Tables . . . . . . . . . . . . . . . . . . . . . . 4.4.1 The Size of Routing Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Interval Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Exercises, Problems, and Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.3 Answers to Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

255 259 261 261 262 267 269 269 274 274

5. Distributed Set Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Distributed Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Selection in a Small Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Simple Case: Selection Among Two Sites . . . . . . . . . . . . . . . . . . 5.2.4 General Selection Strategy: RankSelect . . . . . . . . . . . . . . . . . . . . 5.2.5 Reducing the Worst Case: ReduceSelect. . . . . . . . . . . . . . . . . . . . 5.3 Sorting a Distributed Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Distributed Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Special Case: Sorting on a Ordered Line . . . . . . . . . . . . . . . . . . . 5.3.3 Removing the Topological Constraints: Complete Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Basic Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.5 Efﬁcient Sorting: SelectSort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.6 Unrestricted Sorting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Distributed Sets Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Operations on Distributed Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Local Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Local Evaluation () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.4 Global Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.5 Operational Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Exercises, Problems, and Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.3 Answers to Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

277 277 279 279 280 282 287 292 297 297 299

6. Synchronous Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Synchronous Distributed Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Fully Synchronous Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

333 333 333

303 306 309 312 315 315 317 319 322 323 323 324 324 329 329

CONTENTS

xi

6.1.2 Clocks and Unit of Time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Communication Delays and Size of Messages . . . . . . . . . . . . . . 6.1.4 On the Unique Nature of Synchronous Computations . . . . . . . . 6.1.5 The Cost of Synchronous Protocols . . . . . . . . . . . . . . . . . . . . . . . . Communicators, Pipeline, and Transformers . . . . . . . . . . . . . . . . . . . . . 6.2.1 Two-Party Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Min-Finding and Election: Waiting and Guessing . . . . . . . . . . . . . . . . . 6.3.1 Waiting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Guessing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Double Wait: Integrating Waiting and Guessing . . . . . . . . . . . . . Synchronization Problems: Reset, Unison, and Firing Squad . . . . . . . 6.4.1 Reset / Wake-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Unison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Firing Squad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises, Problems, and Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.3 Answers to Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

334 336 336 342 343 344 353 357 360 360 370 378 385 386 387 389 391 392 392 398 400

7. Computing in Presence of Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Faults and Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Modelling Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.3 Topological Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.4 Fault Tolerance, Agreement, and Common Knowledge . . . . . . 7.2 The Crushing Impact of Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Node Failures: Single-Fault Disaster . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Consequences of the Single Fault Disaster . . . . . . . . . . . . . . . . . . 7.3 Localized Entity Failures: Using Synchrony . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Synchronous Consensus with Crash Failures . . . . . . . . . . . . . . . . 7.3.2 Synchronous Consensus with Byzantine Failures . . . . . . . . . . . . 7.3.3 Limit to Number of Byzantine Entities for Agreement . . . . . . . 7.3.4 From Boolean to General Byzantine Agreement. . . . . . . . . . . . . 7.3.5 Byzantine Agreement in Arbitrary Graphs . . . . . . . . . . . . . . . . . . 7.4 Localized Entity Failures: Using Randomization. . . . . . . . . . . . . . . . . . 7.4.1 Random Actions and Coin Flips . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Randomized Asynchronous Consensus: Crash Failures . . . . . . 7.4.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

408 408 408 410 413 415 417 417 424 425 426 430 435 438 440 443 443 444 449

6.2

6.3

6.4

6.5 6.6

xii

CONTENTS

7.5 Localized Entity Failures: Using Fault Detection . . . . . . . . . . . . . . . . . 7.5.1 Failure Detectors and Their Properties . . . . . . . . . . . . . . . . . . . . . 7.5.2 The Weakest Failure Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Localized Entity Failures: Pre-Execution Failures . . . . . . . . . . . . . . . . . 7.6.1 Partial Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.2 Example: Election in Complete Network . . . . . . . . . . . . . . . . . . . 7.7 Localized Link Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.1 A Tale of Two Synchronous Generals . . . . . . . . . . . . . . . . . . . . . . 7.7.2 Computing With Faulty Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.4 Considerations on Localized Entity Failures . . . . . . . . . . . . . . . . 7.8 Ubiquitous Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.1 Communication Faults and Agreement . . . . . . . . . . . . . . . . . . . . . 7.8.2 Limits to Number of Ubiquitous Faults for Majority . . . . . . . . . 7.8.3 Unanimity in Spite of Ubiquitous Faults . . . . . . . . . . . . . . . . . . . . 7.8.4 Tightness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10 Exercises, Problems, and Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.3 Answers to Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

449 450 452 454 454 455 457 458 461 466 466 467 467 468 475 485 486 488 488 492 493

8. Detecting Stable Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Deadlock Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Deadlock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Detecting Deadlock: Wait-for Graph . . . . . . . . . . . . . . . . . . . . . . . 8.2.3 Single-Request Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.4 Multiple-Requests Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.5 Dynamic Wait-for Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.6 Other Requests Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Global Termination Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 A Simple Solution: Repeated Termination Queries . . . . . . . . . . 8.3.2 Improved Protocols: Shrink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Global Stable Property Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 General Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Time Cuts and Consistent Snapshots . . . . . . . . . . . . . . . . . . . . . . . 8.4.3 Computing A Consistent Snapshot . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.4 Summary: Putting All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

500 500 500 500 501 503 505 512 516 518 519 523 525 526 526 527 530 531 532

CONTENTS

xiii

8.6 Exercises, Problems, and Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.3 Answers to Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

534 534 536 538

9. Continuous Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Keeping Virtual Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Virtual Time and Causal Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 Causal Order: Counter Clocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.3 Complete Causal Order: Vector Clocks . . . . . . . . . . . . . . . . . . . . . 9.2.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Distributed Mutual Exclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 A Simple And Efﬁcient Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.3 Traversing the Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.4 Managing a Distributed Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.5 Decentralized Permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.6 Mutual Exclusion in Complete Graphs: Quorum . . . . . . . . . . . . 9.3.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Deadlock: System Detection and Resolution . . . . . . . . . . . . . . . . . . . . . 9.4.1 System Detection and Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.2 Detection and Resolution in Single-Request Systems . . . . . . . . 9.4.3 Detection and Resolution in Multiple-Requests Systems . . . . . 9.5 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Exercises, Problems, and Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.3 Answers to Exercises. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

541 541 542 542 544 545 548 549 549 550 551 554 559 561 564 566 566 567 568 569 570 570 572 573

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

577

PREFACE

The computational universe surrounding us is clearly quite different from that envisioned by the designers of the large mainframes of half a century ago. Even the subsequent most futuristic visions of supercomputing and of parallel machines, which have guided the research drive and absorbed the research funding for so many years, are far from today’s computational realities. These realities are characterized by the presence of communities of networked entities communicating with each other, cooperating toward common tasks or the solution of a shared problem, and acting autonomously and spontaneously. They are distributed computing environments. It has been from the ﬁelds of network and of communication engineering that the seeds of what we now experience have germinated. The growth in understanding has occurred when computer scientists (initially very few) started to become aware of and study the computational issues connected with these new network-centric realities. The internet, the web, and the grids are just examples of these environments. Whether over wired or wireless media, whether by static or nomadic code, computing in such environments is inherently decentralized and distributed. To compute in distributed environments one must understand the basic principles, the fundamental properties, the available tools, and the inherent limitations. This book focuses on the algorithmics of distributed computing; that is, on how to solve problems and perform tasks efﬁciently in a distributed computing environment. Because of the multiplicity and variety of distributed systems and networked environments and their widespread differences, this book does not focus on any single one of them. Rather it describes and employes a distributed computing universe that captures the nature and basic structure of those systems (e.g., distributed operating systems, data communication networks, distributed databases, transaction processing systems, etc.), allowing us to discard or ignore the system-speciﬁc details while identifying the general principles and techniques. This universe consists of a ﬁnite collection of computational entities communicating by means of messages in order to achieve a common goal; for example, to perform a given task, to compute the solution to a problem, to satisfy a request either from the user (i.e., outside the environment) or from other entities. Although each entity is capable of performing computations, it is the collection 1

Incredibly, the terms “distributed systems” and “distributed computing” have been for years highjacked and (ab)used to describe very limited systems and low-level solutions (e.g., client server) that have little to do with distributed computing.

xv

xvi

PREFACE

of all these entities that together will solve the problem or ensure that the task is performed. In this universe, to solve a problem, we must discover and design a distributed algorithm or protocol for those entities: A set of rules that specify what each entity has to do. The collective but autonomous execution of those rules, possibly without any supervision or synchronization, must enable the entities to perform the desired task to solve the problem. In the design process, we must ensure both correctness (i.e., the protocol we design indeed solves the problem) and efﬁciency (i.e., the protocol we design has a “small” cost). As the title says, this book is on the Design and Analysis of Distributed Algorithms. Its goal is to enable the reader to learn how to design protocols to solve problems in a distributed computing environment, not by listing the results but rather by teaching how they can be obtained. In addition to the “how” and “why” (necessary for problem solution, from basic building blocks to complex protocol design), it focuses on providing the analytical tools and skills necessary for complexity evaluation of designs. There are several levels of use of the book. The book is primarily a seniorundergraduate and graduate textbook; it contains the material for two one-term courses or alternatively a full-year course on Distributed Algorithms and Protocols, Distributed Computing, Network Computing, or Special Topics in Algorithms. It covers the “distributed part” of a graduate course on Parallel and Distributed Computing (the chapters on Distributed Data, Routing, and Synchronous Computing, in particular), and it is the theoretical companion book for a course in Distributed Systems, Advanced Operating Systems, or Distributed Data Processing. The book is written for the students from the students’ point of view, and it follows closely a well deﬁned teaching path and method (the “course”) developed over the years; both the path and the method become apparent while reading and using the book. It also provides a self-contained, self-directed guide for system-protocol designers and for communication software and engineers and developers, as well as for researchers wanting to enter or just interested in the area; it enables hands-on, headon, and in-depth acquisition of the material. In addition, it is a serious sourcebook and referencebook for investigators in distributed computing and related areas. Unlike the other available textbooks on these subjects, the book is based on a very simple fully reactive computational model. From a learning point of view, this makes the explanations clearer and readers’ comprehension easier. From a teaching point of view, this approach provides the instructor with a natural way to present otherwise difﬁcult material and to guide the students through, step by step. The instructors themselves, if not already familiar-with the material or with the approach, can achieve proﬁciency quickly and easily. All protocols in the textbook as well as those designed by the students as part of the exercises are immediately programmable. Hence, the subtleties of actual implementation can be employed to enhance the understanding of the theoretical 2

An open source Java-based engine, DisJ, provides the execution and visualization environment for our reactive protocols.

PREFACE

xvii

design principles; furthermore, experimental analysis (e.g., performance evaluation and comparison) can be easily and usefully integrated in the coursework expanding the analytical tools. The book is written so to require no prerequisites other than standard undergraduate knowledge of operating systems and of algorithms. Clearly, concurrent or prior knowledge of communication networks, distributed operating systems or distributed transaction systems would help the reader to ground the material of this course into some practical application context; however, none is necessary. The book is structured into nine chapters of different lengths. Some are focused on a single problem, others on a class of problems. The structuring of the written material into chapters could have easily followed different lines. For example, the material of election and of mutual exclusion could have been grouped together in a chapter on Distributed Control. Indeed, these two topics can be taught one after the other: Although missing an introduction, this “hidden” chapter is present in a distributed way. An important “hidden” chapter is Chapter 10 on Distributed Graph Algorithms whose content is distributed throughout the book: Spanning-Tree Construction (Section 2.5), Depth-First Traversal (Section 2.3.1), Breadth-First Spanning Tree (Section 4.2.5), Minimum-Cost Spanning Tree (Section 3.8.1), Shortest Paths (Section 4.2.3), Centers and medians (Section 2.6), Cycle and Knot Detection (Section 8.2). The suggested prerequisite structure of the chapters is shown in Figure 1. As suggested by the ﬁgure, the ﬁrst three chapters should be covered sequentially and before the other material. There are only two other prerequisite relationships. The relationship between Synchronous Compution (Chapter 6) and Computing in Presence of Faults (Chapter 7) is particular. The recommended sequencing is in fact the following: Sections 7.1– 7.2 (providing the strong motivation for synchronous computing), Chapter 6 (describing fault-free synchronous computing) and the rest of Chapter 7 (dealing with fault-tolerant synchronous computing as well as other issues). The other suggested

Figure 1: Prerequisite structure of the chapters.

xviii

PREFACE

prerequisite structure is that the topic of Stable Properties (Chapter 8) be handled before that of Continuous Computations (Chapter 9). Other than that, the sections can be mixed and matched depending on the instructor’s preferences and interests. An interesting and popular sequence for a one-semester course is given by Chapters 1–6. A more conventional one-semester sequence is provided by Chapters 1–3 and 6–9. The symbol () after a section indicates noncore material. In connection with Exercises and Problems the symbol () denotes difﬁculty (the more the symbols, the greater the difﬁculty). Several important topics are not included in this edition of the book. In particular, this edition does not include algorithms on distributed coloring, on minimal independent sets, on self-stabilization, as well as on Sense of Direction. By design, this book does not include distributed computing in the shared memory model, focusing entirely on the message-passing paradigm. This book has evolved from the teaching method and the material I have designed for the fourth-year undergraduate course Introduction to Distributed Computing and for the graduate course Principles of Distributed Computing at Carleton University over the last 20 years, and for the advanced graduate courses on Distributed Algorithms I have taught as part of the Advanced Summer School on Distributed Computing at the University of Siena over the last 10 years. I am most grateful to all the students of these courses: through their feedback they have helped me verify what works and what does not, shaping my teaching and thus the current structure of this book. Their keen interest and enthusiasm over the years have been the main reason for the existence of this book. This book is very much work in progress. I would welcome any feedback that will make it grow and mature and change. Comments, criticisms, and reports on personal experience as a lecturer using the book, as a student studying it, or as a researcher glancing through it, suggestions for changes, and so forth: I am looking foreward to receiving any. Clearly, reports on typos, errors, and mistakes are very much appreciated. I tried to be accurate in giving credits; if you know of any omission or mistake in this regards, please let me know. My own experience as well as that of my students leads to the inescapable conclusion that distributed algorithms are fun both to teach and to learn. I welcome you to share this experience, and I hope you will reach the same conclusion. Nicola Santoro

CHAPTER 1

Distributed Computing Environments

The universe in which we will be operating will be called a distributed computing environment. It consists of a ﬁnite collection E of computational entities communicating by means of messages. Entities communicate with other entities to achieve a common goal; for example, to perform a given task, to compute the solution to a problem, to satisfy a request either from the user (i.e., outside the environment) or from other entities. In this chapter, we will examine this universe in some detail.

1.1 ENTITIES The computational unit of a distributed computing environment is called an entity . Depending on the system being modeled by the environment, an entity could correspond to a process, a processor, a switch, an agent, and so forth in the system. Capabilities Each entity x ∈ E is endowed with local (i.e., private and nonshared) memory Mx . The capabilities of x include access (storage and retrieval) to local memory, local processing, and communication (preparation, transmission, and reception of messages). Local memory includes a set of deﬁned registers whose values are always initially deﬁned; among them are the status register (denoted by status(x)) and the input value register (denoted by value(x)). The register status(x) takes values from a ﬁnite set of system states S; the examples of such values are “Idle,” “Processing,” “Waiting,”. . . and so forth. In addition, each entity x ∈ E has available a local alarm clock cx which it can set and reset (turn off). An entity can perform only four types of operations:

local storage and processing transmission of messages (re)setting of the alarm clock changing the value of the status register

Design and Analysis of Distributed Algorithms, by Nicola Santoro Copyright © 2007 John Wiley & Sons, Inc.

1

2

DISTRIBUTED COMPUTING ENVIRONMENTS

Note that, although setting the alarm clock and updating the status register can be considered as a part of local processing, because of the special role these operations play, we will consider them as distinct types of operations. External Events The behavior of an entity x ∈ E is reactive: x only responds to external stimuli, which we call external events (or just events); in the absence of stimuli, x is inert and does nothing. There are three possible external events: arrival of a message ringing of the alarm clock spontaneous impulse The arrival of a message and the ringing of the alarm clock are the events that are external to the entity but originate within the system: The message is sent by another entity, and the alarm clock is set by the entity itself. Unlike the other two types of events, a spontaneous impulse is triggered by forces external to the system and thus outside the universe perceived by the entity. As an example of event generated by forces external to the system, consider an automated banking system: its entities are the bank servers where the data is stored, and the automated teller machine (ATM) machines; the request by a customer for a cash withdrawal (i.e., update of data stored in the system) is a spontaneous impulse for the ATM machine (the entity) where the request is made. For another example, consider a communication subsystem in the open systems interconnection (OSI) Reference Model: the request from the network layer for a service by the data link layer (the system) is a spontaneous impulse for the data-link-layer entity where the request is made. Appearing to entities as “acts of God,” the spontaneous impulses are the events that start the computation and the communication. Actions When an external event e occurs, an entity x ∈ E will react to e by performing a ﬁnite, indivisible, and terminating sequence of operations called action. An action is indivisible (or atomic) in the sense that its operations are executed without interruption; in other words, once an action starts, it will not stop until it is ﬁnished. An action is terminating in the sense that, once it is started, its execution ends within ﬁnite time. (Programs that do not terminate cannot be termed as actions.) A special action that an entity may take is the null action nil, where the entity does not react to the event. Behavior The nature of the action performed by the entity depends on the nature of the event e, as well as on which status the entity is in (i.e., the value of status(x)) when the events occur. Thus the speciﬁcation will take the form Status × Event −→ Action,

ENTITIES

3

which will be called a rule (or a method, or a production). In a rule s × e −→ A, we say that the rule is enabled by (s, e). The behavioral speciﬁcation, or simply behavior, of an entity x is the set B(x) of all the rules that x obeys. This set must be complete and nonambiguous: for every possible event e and status value s, there is one and only one rule in B(x) enabled by (s,e). In other words, x must always know exactly what it must do when an event occurs. The set of rules B(x) is also called protocol or distributed algorithm of x. The behavioral speciﬁcation of the entire distributed computing environment is just the collection of the individual behaviors of the entities. More precisely, the collective behavior B(E) of a collection E of entities is the set B(E) = {B(x): x ∈ E}. Thus, in an environment with collective behavior B(E), each entity x will be acting (behaving) according to its distributed algorithm and protocol (set of rules) B(x). Homogeneous Behavior A collective behavior is homogeneous if all entities in the system have the same behavior, that is, ∀x, y ∈ E, B(x) = B(y). This means that to specify a homogeneous collective behavior, it is sufﬁcient to specify the behavior of a single entity; in this case, we will indicate the behavior simply by B. An interesting and important fact is the following: Property 1.1.1 Every collective behavior can be made homogeneous. This means that if we are in a system where different entities have different behaviors, we can write a new set of rules, the same for all of them, which will still make them behave as before. Example Consider a system composed of a network of several identical workstations and a single server; clearly, the set of rules that the server and a workstation obey is not the same as their functionality differs. Still, a single program can be written that will run on both entities without modifying their functionality. We need to add to each entity an input register, my role, which is initialized to either “workstation” or “server,” depending on the entity; for each status–event pair (s, e) we create a new rule with the following action: s × e −→ { if my role = workstation then Aworkstation else Aserver endif }, where Aworkstation (respectively, Aserver ) is the original action associated to (s, e) in the set of rules of the workstation (respectively, server). If (s, e) did not enable any rule for a workstation (e.g., s was a status deﬁned only for the server), then Aworkstation = nil in the new rule; analogously for the server. It is important to stress that in a homogeneous system, although all entities have the same behavioral description (software), they do not have to act in the same way;

4

DISTRIBUTED COMPUTING ENVIRONMENTS

their difference will depend solely on the initial value of their input registers. An analogy is the legal system in democratic countries: the law (the set of rules) is the same for every citizen (entity); still, if you are in the police force, while on duty, you are allowed to perform actions that are unlawful for most of the other citizens. An important consequence of the homogeneous behavior property is that we can concentrate solely on environments where all the entities have the same behavior. From now on, when we mention behavior we will always mean homogeneous collective behavior.

1.2 COMMUNICATION In a distributed computing environment, entities communicate by transmitting and receiving messages. The message is the unit of communication of a distributed environment. In its more general deﬁnition, a message is just a ﬁnite sequence of bits. An entity communicates by transmitting messages to and receiving messages from other entities. The set of entities with which an entity can communicate directly is not necessarily E; in other words, it is possible that an entity can communicate directly only with a subset of the other entities. We denote by Nout (x) ⊆ E the set of entities to which x can transmit a message directly; we shall call them the out-neighbors of x . Similarly, we denote by Nin (x) ⊆ E the set of entities from which x can receive a message directly; we shall call them the in-neighbors of x. = (V , E), where V The neighborhood relationship deﬁnes a directed graph G ⊆ V × V is the set of edges; the vertices correspond to is the set of vertices and E if and only if the entity (corresponding to) y is an out-neighbor entities, and (x, y) ∈ E of the entity (corresponding to) x. = (V , E) describes the communication topology of the enviThe directed graph G m(G), and d(G) the number of vertices, edges, and ronment. We shall denote by n(G), respectively. When no ambiguity arises, we will omit the reference the diameter of G, and use simply n, m, and d. to G In the following and unless ambiguity should arise, the terms vertex, node, site, and entity will be used as having the same meaning; analogously, the terms edge, arc, and link will be used interchangeably. In summary, an entity can only receive messages from its in-neighbors and send messages to its out-neighbors. Messages received at an entity are processed there in the order they arrive; if more than one message arrive at the same time, they will be processed in arbitrary order (see Section 1.9). Entities and communication may fail.

1.3 AXIOMS AND RESTRICTIONS The deﬁnition of distributed computing environment with point-to-point communication has two basic axioms, one on communication delay, and the other on the local orientation of the entities in the system.

AXIOMS AND RESTRICTIONS

5

Any additional assumption (e.g., property of the network, a priori knowledge by the entities) will be called a restriction. 1.3.1 Axioms Communication Delays Communication of a message involves many activities: preparation, transmission, reception, and processing. In real systems described by our model, the time required by these activities is unpredictable. For example, in a communication network a message will be subject to queueing and processing delays, which change depending on the network trafﬁc at that time; for example, consider the delay in accessing (i.e., sending a message to and getting a reply from) a popular web site. The totality of delays encountered by a message will be called the communication delay of that message. Axiom 1.3.1 Finite Communication Delays In the absence of failures, communication delays are ﬁnite. In other words, in the absence of failures, a message sent to an out-neighbor will eventually arrive in its integrity and be processed there. Note that the Finite Communication Delays axiom does not imply the existence of any bound on transmission, queueing, or processing delays; it only states that in the absence of failure, a message will arrive after a ﬁnite amount of time without corruption. Local Orientation An entity can communicate directly with a subset of the other entities: its neighbors. The only other axiom in the model is that an entity can distinguish between its neighbors. Axiom 1.3.2 Local Orientation An entity can distinguish among its in-neighbors. An entity can distinguish among its out-neighbors. In particular, an entity is capable of sending a message only to a speciﬁc out-neighbor (without having to send it also to all other out-neighbors). Also, when processing a message (i.e., executing the rule enabled by the reception of that message), an entity can distinguish which of its in-neighbors sent that message. In other words, each entity x has a local function lx associating labels, also called port numbers, to its incident links (or ports), and this function is injective. We denote port numbers by lx (x, y), the label associated by x to the link (x, y). Let us stress that this label is local to x and in general has no relationship at all with what y might call there are two labels: lx (x, this link (or x, or itself). Note that for each edge (x, y)∈ E, y) local to x and ly (x, y) local to y (see Figure 1.1). l), where Because of this axiom, we will always deal with edge-labeled graphs (G, l = {lx : x ∈ V } is the set of these injective labelings.

6

DISTRIBUTED COMPUTING ENVIRONMENTS

x

y

FIGURE 1.1: Every edge has two labels

1.3.2 Restrictions In general, a distributed computing system might have additional properties or capabilities that can be exploited to solve a problem, to achieve a task, and to provide a service. This can be achieved by using these properties and capabilities in the set of rules. However, any property used in the protocol limits the applicability of the protocol. In other words, any additional property or capability of the system is actually a restriction (or submodel) of the general model. WARNING. When dealing with (e.g., designing, developing, testing, employing) a distributed computing system or just a protocol, it is crucial and imperative that all restrictions are made explicit. Failure to do so will invalidate the resulting communication software. The restrictions can be varied in nature and type: they might be related to communication properties, reliability, synchrony, and so forth. In the following section, we will discuss some of the most common restrictions. Communication Restrictions The ﬁrst category of restrictions includes those relating to communication among entities. Queueing Policy A link (x, y) can be viewed as a channel or a queue (see Section 1.9): x sending a message to y is equivalent to x inserting the message in the channel. In general, all kinds of situations are possible; for example, messages in the channel might overtake each other, and a later message might be received ﬁrst. Different restrictions on the model will describe different disciplines employed to manage the channel; for example, ﬁrst-in-ﬁrst-out (FIFO) queues are characterized by the following restriction. Message Ordering: In the absence of failure, the messages transmitted by an entity to the same out-neighbor will arrive in the same order they are sent. Note that Message Ordering does not imply the existence of any ordering for messages transmitted to the same entity from different edges, nor for messages sent by the same entity on different edges. Link Property Entities in a communication system are connected by physical links, which may be very different in capabilities. The examples are simplex and full-duplex

7

AXIOMS AND RESTRICTIONS

links. With a fully duplex line it is possible to transmit in both directions. Simplex lines are already deﬁned within the general model. A duplex line can obviously be described as two simplex lines, one in each direction; thus, a system where all lines are fully duplex can be described by the following restriction: Reciprocal communication: ∀x ∈ E, Nin (x) = Nout (x). In other words, if then also (y, x)∈ E. (x, y) ∈ E Notice that, however, (x, y) = (y, x), and in general lx (x, y) = lx (y, x); furthermore, x might not know that these two links are connections to and from the same entity. A system with fully duplex links that offers such a knowledge is deﬁned by the following restriction. Bidirectional links: ∀x ∈ E, Nin (x) = Nout (x) and lx (x, y) = lx (y, x).

IMPORTANT. The case of Bidirectional Links is special. If it holds, we use a simpliﬁed terminology. The network is viewed as an undirected graph G = (V,E) (i.e., ∀ x,y∈ E, (x,y) = (y, x) ), and the set N(x) = Nin (x) = Nout (x) will just be called = |E| = 2 |E| = 2 m(G). the set of neighbors of x. Note that in this case, m(G) is depicted where the Bidirectional Links For example, in Figure 1.2 a graph G restriction and the corresponding undirected graph G hold. Reliability Restrictions Other types of restrictions are those related to reliability, faults, and their detection.

b

c

X

Z

c

b

d

a a

d b

b

X

c

Z d

a

c

b

c

Y

G = ( V, E )

b

c

Y

G = ( V, E )

FIGURE 1.2: In a network with Bidirectional Links we consider the corresponding undirected graph.

8

DISTRIBUTED COMPUTING ENVIRONMENTS

Detection of Faults Some systems might provide a reliable fault-detection mechanism. Following are two restrictions that describe systems that offer such capabilities in regard to component failures: Edge failure detection: ∀ (x, y) ∈ E, both x and y will detect whether (x, y) has failed and, following its failure, whether it has been reactivated. Entity failure detection: ∀x ∈ V , all in- and out-neighbors of x can detect whether x has failed and, following its failure, whether it has recovered. Restricted Types of Faults In some systems only some types of failures can occur: for example, messages can be lost but not corrupted. Each situation will give rise to a corresponding restriction. More general restrictions will describe systems or situations where there will be no failures: Guaranteed delivery: Any message that is sent will be received with its content uncorrupted. Under this restriction, protocols do not need to take into account omissions or corruptions of messages during transmission. Even more general is the following: Partial reliability: No failures will occur. Under this restriction, protocols do not need to take failures into account. Note that under Partial Reliability, failures might have occurred before the execution of a computation. A totally fault-free system is deﬁned by the following restriction. Total reliability: Neither have any failures occurred nor will they occur. Clearly, protocols developed under this restriction are not guaranteed to work correctly if faults occur. Topological Restrictions In general, an entity is not directly connected to all other entities; it might still be able to communicate information to a remote entity, using others as relayer. A system that provides this capability for all entities is characterized by the following restriction: Connectivity: The communication topology G is strongly connected. it is possible to reach every other vertex. In case That is, from every vertex in G the restriction “Bidirectional Links” holds as well, connectedness will simply state that G is connected.

COST AND COMPLEXITY

9

Time Restrictions An interesting type of restrictions is the one relating to time. In fact, the general model makes no assumption about delays (except that they are ﬁnite). Bounded communication delays: There exists a constant ⌬ such that, in the absence of failures, the communication delay of any message on any link is at most ⌬. A special case of bounded delays is the following: Unitary communication delays: In the absence of failures, the communication delay of any message on any link is one unit of time. The general model also makes no assumptions about the local clocks. Synchronized clocks: All local clocks are incremented by one unit simultaneously and the interval of time between successive increments is constant. 1.4 COST AND COMPLEXITY The computing environment we are considering is deﬁned at an abstract level. It models rather different systems (e.g., communication networks, distributed systems, data networks, etc.), whose performance is determined by very distinctive factors and costs. The efﬁciency of a protocol in the model must somehow reﬂect the realistic costs encountered when executed in those very different systems. In other words, we need abstract cost measures that are general enough but still meaningful. We will use two types of measures: the amount of communication activities and the time required by the execution of a computation. They can be seen as measuring costs from the system point of view (how much trafﬁc will this computation generate and how busy will the system be?) and from the user point of view (how long will it take before I get the results of the computation?). 1.4.1 Amount of Communication Activities The transmission of a message through an out-port (i.e., to an out-neighbor) is the basic communication activity in the system; note that the transmission of a message that will not be received because of failure still constitutes a communication activity. Thus, to measure the amount of communication activities, the most common function used is the number of message transmissions M, also called message cost. So in general, given a protocol, we will measure its communication costs in terms of the number of transmitted messages. Other functions of interest are the entity workload Lnode = M/|V |, that is, the number of messages per entity, and the transmission load Llink = M/|E|, that is, the number of messages per link.

10

DISTRIBUTED COMPUTING ENVIRONMENTS

Messages are sequences of bits; some protocols might employ messages that are very short (e.g., O(1) bit signals), others very long (e.g., .gif ﬁles). Thus, for a more accurate assessment of a protocol, or to compare different solutions to the same problem that use different sizes of messages, it might be necessary to use as a cost measure the number of transmitted bits B also called bit complexity. In this case, we may sometimes consider the bit-deﬁned load functions: the entity bit-workload Lbnode = B/|V |, that is, the number of bits per entity, and the transmission bit-load Lblink = B/|E|, that is, the number of bits per link. 1.4.2 Time An important measure of efﬁciency and complexity is the total execution delay, that is, the delay between the time the ﬁrst entity starts the execution of a computation and the time the last entity terminates its execution. Note that “time” is here intended as the one measured by an observer external to the system and will also be called real or physical time. In the general model there is no assumption about time except that communication delays for a single message are ﬁnite in absence of failure (Axiom 1.3.1). In other words, communication delays are in general unpredictable. Thus, even in the absence of failures, the total execution delay for a computation is totally unpredictable; furthermore, two distinct executions of the same protocol might experience drastically different delays. In other words, we cannot accurately measure time. We, however, can measure time assuming particular conditions. The measure usually employed is the ideal execution delay or ideal time complexity, T: the execution delay experienced under the restrictions “Unitary Transmission Delays” and “Synchronized Clocks;” that is, when the system is synchronous and (in the absence of failure) takes one unit of time for a message to arrive and to be processed. A very different cost measure is the causal time complexity, Tcausal . It is deﬁned as the length of the longest chain of causally related message transmissions, over all possible executions. Causal time is seldom used and is very difﬁcult to measure exactly; we will employ it only once, when dealing with synchronous computations.

1.5 AN EXAMPLE: BROADCASTING Let us clarify the concepts expressed so far by means of an example. Consider a distributed computing system where one entity has some important information unknown to the others and would like to share it with everybody else. This problem is called broadcasting and it is part of a general class of problems called information diffusion. To solve this problem means to design a set of rules that, when executed by the entities, will lead (within ﬁnite time) to all entities knowing the information; the solution must work regardless of which entity had the information at the beginning. be the communication topology. Let E be the collection of entities and G

AN EXAMPLE: BROADCASTING

11

To simplify the discussion, we will make some additional assumptions (i.e., restrictions) on the system: 1. Bidirectional links; that is, we consider the undirected graph G. (see Section 1.3.2). 2. Total reliability, that is, we do not have to worry about failures. Observe that, if G is disconnected, some entities can never receive the information, and the broadcasting problem will be unsolvable. Thus, a restriction that (unlike the previous two) we need to make is as follows: 3. Connectivity; that is, G is connected. Further observe that built in the deﬁnition of the problem, there is the assumption that only the entity with the initial information will start the broadcast. Thus, a restriction built in the deﬁnition is as follows: 4. Unique Initiator, that is, only one entity will start. A simple strategy for solving the broadcast problem is the following: “if an entity knows the information, it will share it with its neighbors.” To construct the set of rules implementing this strategy, we need to deﬁne the set S of status values; from the statement of the problem it is clear that we need to distinguish between the entity that initially has the information and the others: {initiator, idle} ⊆ S. The process can be started only by the initiator; let I denote the information to be broadcasted. Here is the set of rules B(x) (the same for all entities): 1. 2. 3. 4.

initiator ×ι −→ {send(I) to N (x)} idle × Receiving(I) −→ {Process(I); send(I) to N (x)} initiator × Receiving(I) −→ nil idle ×ι −→ nil

where ι denotes the spontaneous impulse event and nil denotes the null action. Because of connectivity and total reliability, every entity will eventually receive the information. Hence, the protocol achieves its goal and solves the broadcasting problem. However, there is a serious problem with these rules: the activities generated by the protocol never terminate. Consider, for example, the simple system with three entities x, y, z connected to each other (see Figure 1.3). Let x be the initiator, y and z be idle, and all messages travel at the same speed; then y and z will be forever sending messages to each other (as well as to x).

12

DISTRIBUTED COMPUTING ENVIRONMENTS

X

Z

X

Y

X

Z

Y

Z

X

Z

Y

Y

FIGURE 1.3: An execution of Flooding.

To avoid this unwelcome effect, an entity should send the information to its neighbors only once: the ﬁrst time it acquires the information. This can be achieved by introducing a new status done; that is S ={initiator, idle, done}. 1. 2. 3. 4. 5. 6.

initiator ×ι −→ {send(I ) to N (x); become done} idle × Receiving(I) −→ {Process(I); become done; send(I) to N (x)} initiator × Receiving(I) −→ nil idle × ι −→ nil done × Receiving(I) −→ nil done × ι −→ nil

where become denotes the operation of changing status. This time the communication activities of the protocol terminate: Within ﬁnite time all entities become done; since a done entity knows the information, the protocol is correct (see Exercise 1.12.1 ). Note that depending on transmission delays, different executions are possible; one such execution in an environment composed of three entities x, y, z connected to each other, where x is the initiator as depicted in Figure 1.3. IMPORTANT. Note that entities terminate their execution of the protocol (i.e., become done) at different times; it is actually possible that an entity has terminated while others have not yet started. This is something very typical of distributed computations: There is a difference between local termination and global termination.

AN EXAMPLE: BROADCASTING

13

IMPORTANT. Notice also that in this protocol nobody ever knows when the entire process is over. We will examine these issues in details in other chapters, in particular when discussing the problem of termination detection. The above set of rules correctly solves the problem of broadcasting. Let us now calculate the communication costs of the algorithm. First of all, let us determine the number of message transmissions. Each entity, whether initiator or not, sends the information to all its neighbors. Hence the total number of messages transmitted is exactly x∈E

|N (x)| = 2 |E| = 2 m.

We can actually reduce the cost. Currently, when an idle entity receives the message, it will broadcast the information to all its neighbors, including the entity from which it had received the information; this is clearly unnecessary. Recall that, by the Local Orientation axiom, an entity can distinguish among its neighbors; in particular, when processing a message, it can identify from which port it was received and avoid sending a message there. The ﬁnal protocol is as before with only this small modiﬁcation. Protocol Flooding 1. 2. 3. 4. 5. 6.

initiator ×ι −→ {send(I) to N (x); become done} idle × Receiving(I) −→ {Process(I); become done; send(I) to N (x)-sender} initiator × Receiving(I) −→ nil idle ×ι −→ nil done × Receiving(I) −→ nil done ×ι −→ nil

where sender is the neighbor that sent the message currently being processed. This algorithm is called Flooding as the entire system is “ﬂooded” with the message during its execution, and it is a basic algorithmic tool for distributed computing. As for the number of message transmissions required by ﬂooding, because we avoid transmitting some messages, we know that it is less than 2m; in fact, (Exercise 1.12.2): M[Flooding] = 2m − n + 1.

(1.1)

Let us examine now the ideal time complexity of ﬂooding. Let d(x, y) denote the distance (i.e., the length of the shortest path) between x and y in G. Clearly the message sent by the initiator has to reach every entity in the system, including the furthermost one from the initiator. So, if x is the initiator, the ideal time complexity will be r(x) = Max {d(x, y) : y ∈ E}, which is called the eccentricity (or radius) of x. In other words, the total time depends on which entity is the initiator and

14

DISTRIBUTED COMPUTING ENVIRONMENTS

thus cannot be known precisely beforehand. We can, however, determine exactly the ideal time complexity in the worst case. Since any entity could be the initiator, the ideal time complexity in the worst case will be d(G) = Max {r(x) : x ∈ E}, which is the diameter of G. In other words, the ideal time complexity will be at most the diameter of G: T[Flooding] ≤ d(G).

(1.2)

1.6 STATES AND EVENTS Once we have deﬁned the behavior of the entities, their communication topology, and the set of restrictions under which they operate, we must describe the initial conditions of our environment. This is done ﬁrst of all by specifying the initial condition of all the entities. The initial content of all the registers of entity x and the initial value of its alarm clock cx at time t constitute the initial internal state σ (x, 0) of x. Let (0) = {σ (x, 0) : x ∈ E} denote the set of all the initial internal states. Once (0) is deﬁned, we have completed the static speciﬁcation of the environment: the description of the system before any event occurs and before any activity takes place. We are, however, also interested in describing the system during the computational activities, as well as after such activities. To do so, we need to be able to describe the changes that the system undergoes over time. As mentioned before, the entities (and, thus the environments) are reactive. That is, any activity of the system is determined entirely by the external events. Let us examine these facts in more detail. 1.6.1 Time and Events In distributed computing environments, there are only three types of external events: spontaneous impulse (spontaneously), reception of a message (receiving), and alarm clock ring (when). When an external event occurs at an entity, it triggers the execution of an action (the nature of the action depends on the status of the entity when the event occurs). The executed action may generate new events: The operation send will generate a receiving event, and the operation set alarm will generate a when event. Note ﬁrst of all that the events so generated might not occur at all. For example, a link failure may destroy the traveling message, destroying the corresponding receiving event; in a subsequent action, an entity may turn off the previously set alarm destroying the when event. Notice now that if they occur, these events will do so at a later time (i.e., when the message arrives, when the alarm goes off). This delay might be known precisely in the case of the alarm clock (because it is set by the entity); it is, however, unpredictable in the case of message transmission (because it is due to the conditions external to the entity). Different delays give rise to different executions of the same protocols with possibly different outcomes.

STATES AND EVENTS

15

Summarizing, each event e is “generated” at some time t(e) and, if it occurs, it will happen at some time later. By deﬁnition, all spontaneous impulses are already generated before the execution starts; their set will be called the set of initial events. The execution of the protocol starts when the ﬁrst spontaneous impulses actually happen; by convention, this will be time t = 0. IMPORTANT. Notice that “time” is here considered as seen by an external observer and is viewed as real time. Each real time instant t separates the axis of time into three parts: past (i.e., {t < t}), present (i.e., t), and future (i.e., {t > t}). All events generated before t that will happen after t are called the future at t and denoted by Future(t); it represents the set of future events determined by the execution so far. An execution is fully described by the sequence of events that have occurred. For small systems, an execution can be visualized by what is called a Time × Event Diagram (TED) . Such a diagram is composed of temporal lines, one for each entity in the system. Each event is represented in such a diagram as follows: A Receiving event r is represented as an arrow from the point tx (r) in the temporal line of the entity x generating e (i.e., sending the message) to the point ty (r) in the temporal line of the entity y where the events occur (i.e., receiving the message). A When event w is represented as an arrow from point tx (w) to point tx (w) in the temporal line of the entity setting the clock. A Spontaneously event ι is represented as a short arrow indicating point tx (ι) in the temporal line of the entity x where the events occur.

For example, in Figure 1.4 is depicted the TED corresponding to the execution of Protocol Flooding of Figure 1.3.

x

y

z

FIGURE 1.4: Time × Event Diagram

16

DISTRIBUTED COMPUTING ENVIRONMENTS

1.6.2 States and Conﬁgurations The private memory of each entity, in addition to the behavior, contains a set of registers, some of them already initialized, others to be initialized during the execution. The content of all the registers of entity x and the value of its alarm clock cx at time t constitutewhat is called the internal state of x at t and is denoted by σ (x, t). We denote by (t) the set of the internal states at time t of all entities. Internal states change with time and the occurrence of events. There is an important fact about internal states. Consider two different environments, E1 and E2 , where, by accident, the internal state of x at time t is the same. Then x cannot distinguish between the two environments, that is, x is unable to tell whether it is in environment E1 or E2 . There is an important consequence. Consider the situation just described: At time t, the internal state of x is the same in both E1 and E2 . Assume now that also by accident, exactly the same event occurs at x (e.g., the alarm clock rings or the same message is received from the same neighbor). Then x will perform exactly the same action in both cases, and its internal state will continue to be the same in both situations. Property 1.6.1 Let the same event occur at x at time t in two different executions, and let σ1 and σ2 be its internal states when this happens. If σ1 = σ2 , then the new internal state of x will be the same in both executions. Similarly, if two entities have the same internal state, they cannot distinguish between each other. Furthermore, if by accident, exactly the same event occurs at both of them (e.g., the alarm clock rings or the same message is received from the same neighbor), then they will perform exactly the same action in both cases, and their internal state will continue to be the same in both situations. Property 1.6.2 Let the same event occur at x and y at time t, and let σ1 and σ2 be their internal states, respectively, at that time. If σ1 = σ2 , then the new internal state of x and y will be the same. Remember: Internal states are local and an entity might not be able to infer from them information about the status of the rest of the system. We have talked about the internal state of an entity, initially (i.e., at time t = 0) and during an execution. Let us now focus on the state of the entire system during an execution. To describe the global state of the environment at time t, weobviously need to specify the internal state of all entities at that time; that is, the set (t). However, this is not enough. In fact, the execution so far might have already generated some events that will occur after time t; these events, represented by the set Future(t), are integral part of this execution and must be speciﬁed as well. Speciﬁcally, the global state, called conﬁguration, of the system during an execution is speciﬁed by the couple t , Future t C t =

PROBLEMS AND SOLUTIONS ()

17

The initial conﬁguration C(0) contains not only the initial set of states (0) but also the set Future(0) of the spontaneous impulses. Environments that differ only in their initial conﬁguration will be called instances of the same system. The conﬁguration C(t) is like a snapshot of the system at time t.

1.7 PROBLEMS AND SOLUTIONS () The topic of this book is how to design distributed algorithms and analyze their complexity. A distributed algorithm is the set of rules that will regulate the behaviors of the entities. The reason why we may need to design the behaviors is to enable the entities to solve a given problem, perform a deﬁned task, or provide a requested service. In general, we will be given a problem, and our task is to design a set of rules that will always solve the problem in ﬁnite time. Let us discuss these concepts in some details. Problems To give a problem (or task, or service) P means to give a description of what the entities must accomplish. This is done by stating what the initial conditions of the entities are (and thus of the system), and what the ﬁnal conditions should be; it should also specify all given restrictions. In other words, P = PINIT , PFINAL , R , where PINIT and PFINAL are predicates on the values of the registers of the entities, and R is a set of restrictions. Let wt (x) denote the value of an input register w(x) at time t and {wt } = {wt (x) : x ∈ E} the values of this register at all entities at that time. So, for example, {status0 } represents the initial value of the status registers of the entities. For example, in the problem Broadcasting (I ) described in Section 1.5, the initial and ﬁnal conditions are given by the predicates PINIT (t) ≡ “ only one entity has the information at time t” ≡ ∃x ∈ E (valuet (x) = I ∧ ∀y = x (valuet (y) = ø)), PFINAL (t) ≡ “ every entity has the information at time t” ≡ ∀x ∈ E (valuet (x) = I ). The restrictions we have imposed on our solution are BL (Bidirectional Links), TR (Total Reliability), and CN (Connectivity). Implicit in the problem deﬁnition there is also the condition that only the entity with the information will start the execution of the solution protocol; denote by UI the predicate describing this restriction, called Unique Initiator. Summarizing, for Broadcasting, the set of restrictions we have made is {BL, TR, CN, UI}.

18

DISTRIBUTED COMPUTING ENVIRONMENTS

Status A solution protocol B for P = PINIT , PFINAL , R will specify how the entities will accomplish the required task. Part of the design of the set of rules B(x) is the deﬁnition of the set of status values S, that is, the values that can be held by the status register status(x). We call initial status values those values of S that can be held at the start of the execution of B(x) and we shall denote their set by SINIT . By contrast, terminal status values are those values that once reached, cannot ever be changed by the protocol; their set shall be denoted by STERM . All other values in S will be called intermediate status values. For example, in the protocol Flooding described in Section 1.5, SINIT ={initiator, idle} and STERM ={done}. Depending on the restrictions of the problem, only entities in speciﬁc initial status values will start the protocol; we shall denote by SSTART ⊆ SINIT the set of those status values. Typically, SSTART consists of only one status; for example, in Flooding, SSTART ={initiator}. It is possible to rewrite a protocol so that this is always the case (see Exercise 1.12.5). Among terminal status values we shall distinguish those in which no further activity can take place; that is, those where the only action is nil. We shall call such status values ﬁnal and we shall denote by SFINAL ⊆ STERM the set of those status values. For example, in Flooding, SFINAL ={done}. Termination Protocol B terminates if, for all initial conﬁgurations C(0) satisfying PINIT , and for all executions starting from those conﬁgurations, the predicate Terminate (t) ≡ ({statust } ⊆ STERM )∧ (Future(t) = ∅) holds for some t > 0, that is, all entities enter a terminal status after a ﬁnite time and all generated events have occurred. We have already remarked on the fact that entities might not be aware that the termination has occurred. In general, we would like each entity to know at least of its termination. This situation, called explicit termination, is said to occur if the predicate Explicit-Terminate (t) ≡ ({statust } ⊆ SFINAL ) holds for some t > 0, that is, all entities enter a ﬁnal status after a ﬁnite time. Correctness Protocol B is correct if, for all executions starting from initial conﬁgurations satisfying PINIT , ∃t > 0 : Correct(t) holds, where Correct(t) ≡ (∀t ≥ t, PFINAL (t)); that is, the ﬁnal predicate eventually holds and does not change.

KNOWLEDGE

19

Solution Protocol The set of rules B solves problem P if it always correctly terminates under the problem restrictions R. As there are two types of termination (simple and explicit), we will have two types of solutions: Simple Solution[B,P] where the predicate ∃t > 0 (Correct(t)∧ Terminate(t)) holds, under the problem restrictions R, for all executions starting from initial conﬁgurations satisfying PINIT ; and Explicit Solution[B,P] where the predicate ∃t > 0 (Correct(t)∧ Explicit-Terminate(t)) holds, under the problem restrictions R, for all executions starting from initial conﬁgurations satisfying PINIT .

1.8 KNOWLEDGE The notions of information and knowledge are fundamental in distributed computing. Informally, any distributed computation can be viewed as the process of acquiring information through communication activities; conversely, the reception of a message can be viewed as the process of transforming the state of knowledge of the processor receiving the message. 1.8.1 Levels of Knowledge The content of the local memory of an entity and the information that can be derived from it constitute the local knowledge of an entity. We denote by p ∈ LKt [x] the fact that p is local knowledge at x at the global time instant t. By deﬁnition, lx ∈ LKt [x] for all t, that is, the (labels of the) in- and out-edges of x are timeinvariant local knowledge of x. Sometimes it is necessary to describe knowledge held by more than one entity at a given time. Information p is said to be implicit knowledge in W ⊆ E at time t, denoted by p ∈ IKt [W ], if at least one entity in W knows p at time t, that is, p ∈ IKt [W ] iff ∃x ∈ W (p ∈ LKt [x]). A stronger level of knowledge in a group W of entities is held when, at a given time t, p is known to every entity in the group, denoted by p ∈ EKt [W ], that is p ∈ EKt [W ] iff ∀x ∈ W (p ∈ LKt [x]).

20

DISTRIBUTED COMPUTING ENVIRONMENTS

In this case, p is said to be explicit knowledge in W ⊆ E at time t. Consider for example broadcasting discussed in the previous section. Initially, at time t = 0, only the initiator s knows the information I; in other words, I ∈ LK0 [s]. Thus, at that time, I is implicitly known to all entities, that is, I ∈ IK0 [E]. At the end of the broadcast, at time t , every entity will know the information; in other words, I ∈ EKt [E]. Notice that, in the absence of failures, knowledge cannot be lost, only gained, that is, for all t > t and all W ⊆ E, if no failure occurs, IKt [W ] ⊆ IKt [W ] and EKt [W ] ⊆ EKt [W ]. Assume that a fact p is explicit knowledge in W at time t. It is possible that some (maybe all) entities are not aware of this situation. For example, assume that at time t, entities x and y know the value of a variable of z, say its ID; then the ID of z is explicit knowledge in W={x, y, z}; however, z might not be aware that x and y know its ID. In other words, when p ∈ EKt [W ], the fact “p ∈ EKt [W ]" might not be even locally known to any of the entities in W. This gives rise to the highest level of knowledge within a group: common knowledge. Information p is said to be common knowledge in W ⊆ E at time t , denoted by p ∈ CKt [W ], if and only if at time t every entity in W knows p, and knows that every entity in W knows p, and knows that entity in W knows that every entity in W knows p, and . . . , etcetera, that is, p ∈ CKt [W ] iff

1≤i≤∞ Pi ,

where the Pi ’s are the predicates deﬁned by: P1 = [p ∈ ESt [W ]] and Pi+1 = [Pi ∈ EKt [W ]]. In most distributed problems, it will be necessary for the entities to achieve common knowledge. Fortunately, we do not always have to go to ∞ to reach common knowledge, and a ﬁnite number of steps might actually do, as indicated by the following example. Example (muddy forehead): Imagine n perceptive and intelligent school children playing together during recess. They are forbidden to play in the mud puddles, and the teacher has told them that if they do, there will be severe consequences. Each child wants to keep clean, but the temptation to play with mud is too great to resist. As a result, k of the children get mud on their foreheads. When the teacher arrives, she says, “I see that some of you have been playing in the mud puddle: the mud on your foreheads is a dead giveaway !” and then continues, “The guilty ones who come forward spontaneously will be given a small penalty; those who do not, will receive a punishment they will not easily forget.” She then adds, “I am going to leave the room now, and I will return periodically; if you decide to confess, you must all come forward together when I am in the room. In the meanwhile, everybody must sit absolutely still and without talking.” Each child in the room clearly understands that those with mud on their foreheads are “dead meat,” who will be punished no matter what. Obviously, the children do

KNOWLEDGE

21

not want to confess if the foreheads are clean, and clearly, if the foreheads are dirty, they want to go forward so as to avoid their terrible punishment for those who do not confess. As each child shares the same concern, the collective goal is for the children with clean foreheads not to confess and for those with muddy foreheads to go forward simultaneously, and all of this without communication. Let us examine this goal. The ﬁrst question is as follows: can a child x ﬁnd out whether his/her forehead is dirty or not ? She/he can see how many, say fx , of the other children are dirty; thus, the question is if x can determine whether k = fx or k = fx + 1. The second, more complex question is as follows: can all the children with mud on their foreheads ﬁnd out at the same time so that they can go forward together ? In other words, can the exact value of k become common knowledge ? The children, being perceptive and intelligent, determine that the answer to both the questions is positive and ﬁnd the way to achieve the common goal and thus common knowledge without communication (Exercise 1.12.6). IMPORTANT. When working in a submodel, all the restrictions deﬁning the submodel are common knowledge to all entities (unless otherwise speciﬁed). 1.8.2 Types of Knowledge We can have various types of knowledge, such as knowledge about the communication topology, about the labeling of the communication graph, about the input data of the communicating entities. In general, if we have some knowledge of the system, we can exploit it to reduce the cost of a protocol, although this may result in making the applicability of the protocol more limited. A type of knowledge of particular interest is the one regarding the communication In fact, as will be seen later, the complexity of a comtopology (i.e., the graph G). Following putation may vary greatly depending on what the entities know about G. are some elements that, if they are common knowledge to the entities, may affect the complexity. 1. Metric Information: numeric information about the network; for example, number n = |V | of nodes, number m = |E| of links, diameter, girth, etcetera. This information can be exact or approximate. 2. Topological Properties: knowledge of some properties of the topology; for is a ring network,” “G does not have cycles,” “G is a Cayley example, “G graph,” etcetera. 3. Topological Maps: a map of the neighborhood of the entity up to distance d, a (e.g., the adjacency matrix of G); a complete “map” of complete “map” of G (G,l) (i.e., it contains also the labels), etcetera. Note that some types of knowledge imply other knowledge; for example, if an entity with k neighbors knows that the network is a complete undirected graph, then it knows that n = k + 1.

22

DISTRIBUTED COMPUTING ENVIRONMENTS

As a topological map provides all possible metric and structural information, this type of knowledge is very powerful and important. The strongest form of this type is full topological knowledge: availability at each entity of a labeled graph isomorphic l), the isomorphism, and its own image, that is, every entity has a complete to (G, map of (v, l) with the indication, “You are here.” Another type of knowledge refers to the labeling l. What is very important is whether the labeling has some global consistency property. We can distinguish two other types, depending on whether the knowledge is about the (input) data or the status of the entities and of the system, and we shall call them type-D and type-S, respectively. Examples of type-D knowledge are the following: Unique identiﬁers: all input values are distinct; Multiset: input values are not necessarily identical; Size: number of distinct values. Examples of type-S knowledge are the following: System with leader: there is a unique entity in status “leader”; Reset: all nodes are in the same status; Unique initiator: there is a unique entity in status “initiator.” For example, in the broadcasting problem we discussed in Section 1.5, this knowledge was assumed as a part of the problem deﬁnition. 1.9 TECHNICAL CONSIDERATIONS 1.9.1 Messages The content of a message obviously depends on the application; in any case, it consists of a ﬁnite (usually bounded) sequence of bits. The message is typically divided into subsequences, called ﬁelds, with a predeﬁned meaning (“type”) within the protocol. The examples of ﬁeld types are the following: message identiﬁer or header used to distinguish between different types of messages; originator and destination ﬁelds used to specify the (identity of the) entity originating this message and of the entity to whom the message is intended for; data ﬁelds used to carry information needed in the computation (the nature of the information obviously depends on the particular application under consideration). Thus, in general, a message M will be viewed as a tuple M = f1 , f2 , . . . fk

where k is a (small) predeﬁned constant, and each fi (1 ≤ i ≤ k) is a ﬁeld of a speciﬁed type, each type of a ﬁxed length. So, for example, in protocol Flooding, there is only one type of message; it is composed of two ﬁelds M = f1 , f2 where f1 is a message identiﬁer (containing the information: “this is a broadcast message”), and f2 is a data ﬁeld containing the actual information I being broadcasted. If (the limit on) the size of a message is a system parameter (i.e., it does not depend on the particular application), we say that the system has bounded messages. Such is, for example, the limit imposed on the message length in packet-switching networks, as well as on the length of control messages in circuit-switching networks (e.g., telephone networks) and in message-switching networks.

TECHNICAL CONSIDERATIONS

23

Bounded messages are also called packets and contain at most µ(G) bits, where µ(G) is the system-dependent bound called packet size. Notice that, to send a sequence of K bits in G will require the transmission of at least K/µ(G) packets. 1.9.2 Protocol Notation A protocol B(x) is a set of rules. We have already introduced in Section 1.5 most of the notation for describing those rules. Let us now complete the description of the notation we will use for protocols. We will employ the following conventions: 1. Rules will be grouped by status. 2. If the action for a (status,event) pair is nil, then, for simplicity, the corresponding rule will be omitted from the description. As a consequence, if no rule is described for a (status,event) pair, the default will be that the pair enables the Null action. WARNING. Although convenient (it simpliﬁes the writing), the use of this convention must generate extra care in the description: If we forget to write a rule for an event occurring in a given status, it will be assumed that a rule exists and the action is nil. 3. If an action contains a change of status, this operation will be the last one before exiting the action. 4. The set of status values of the protocol, and the set of restrictions under which the protocol operates will be explicit. Using these conventions, the protocol Flooding deﬁned in Section 1.5 will be written as shown in Figure 1.5. Precedence The external events are as follows: spontaneous impulse (Spontaneously), reception of a message (Receiving), and alarm clock ring (When). Different types of external events can occur simultaneously; for example, the alarm clock might ring at the same time a message arrives. The simultaneous events will be processed sequentially. To determine the order in which they will be processed, we will use the following precedence between external events: Spontaneously > When > Receiving; that is, the spontaneous impulse takes precedence over the alarm clock, which has precedence over the arrival of a message. At most one spontaneous impulse can always occur at an entity at any one time. As there is locally only one alarm clock, at any time there will be at most one When event. By contrast, it is possible that more than one message arrive at the same time to an entity from different neighbors; should this be the case, these simultaneous

24

DISTRIBUTED COMPUTING ENVIRONMENTS

PROTOCOL Flooding .

Status Values: S = {INITIATOR, IDLE, DONE}; SINIT = {INITIATOR, IDLE}; STERM = {DONE}.

Restrictions: Bidirectional Links, Total Reliability, Connectivity, and Unique Initiator. INITIATOR Spontaneously begin send(M) to N (x); become DONE; end IDLE Receiving(I ) begin Process(M); send(M) to N (x) − {sender}; become DONE; end

FIGURE 1.5: Flooding Protocol

Receiving events have all the same precedence and will be processed sequentially in an arbitrary order. 1.9.3 Communication Mechanism The communication mechanisms of a distributed computing environment must handle transmissions and arrivals of messages. The mechanisms at an entity can be seen as a system of queues. corresponds to a queue, with access at x and exit at y; the Each link (x, y) ∈ E access is called out-port and the exit is called in-port. Each entity has thus two types of ports: out-ports, one for each out-neighbor (or out-link), and in-port, one for each in-neighbor (or in-link). At an entity, each outport has a distinct label (recall the Local Orientation axiom (Axiom 1.3.2)) called port number: the out-port corresponding to (x, y) has label lx (x, y); similarly for the in-ports. The sets Nin and Nout will in practice consist of the port numbers associated to those neighbors; this is because an entity has no other information about its neighbors (unless we add restrictions). The command “send M to W” will have a copy of the message M sent through each of the out-ports speciﬁed by W. When a message M is sent through an out-port l, it is inserted in the corresponding queue. In absence of failures (recall the Finite Communication Delays axiom), the communication mechanism will eventually remove it from the queue and deliver it to the other entity through the corresponding in-port, generating the Receiving (M) event; at that time the variable sender will be set to l.

BIBLIOGRAPHICAL NOTES

25

1.10 SUMMARY OF DEFINITIONS Distributed Environment: Collection of communicating computational entities. Communication: Transmission of message. Message: Bounded sequence of bits. Entity’s Capability: Local processing, local storage, access to a local clock, and communication. Entity’s Status Register: At any time an entity status register has a value from a predeﬁned set of status values. External Events: Arrival of a message, alarm clock ring, and spontaneous impulse. Entity’s Behavior: Entities react to external events. The behavior is dictated by a set of rules. Each rule has the form STATUS × EVENT → Action specifying what the entity has to do if a certain external event occurs when the entity is in a given status. The set of rules must be nonambiguous and complete. Actions: An action is an indivisible (i.e., uninterruptible) ﬁnite sequence of operations (local processing, message transmission, change of status, and setting of alarm clock). Homogeneous System: A system is homogeneous if all the entities have the same behavior. Every system can be made homogeneous. Neighbors: The in-neighbors of an entity are those entities from which x can receive a message directly; the out-neighbors are those to which x can send a message directly. Communication Topology: The directed graph G = (V , E) deﬁned by the neighborhood relation. If the Bidirectional Links restriction holds, then G is undirected. Axioms: There are two axioms: local orientation and ﬁnite communication delays. Local Orientation: An entity can distinguish between its out-neighbors and its in-neighbors. Finite Communication Delays: In absence of failures, a message eventually arrives. Restriction: Any additional property. 1.11 BIBLIOGRAPHICAL NOTES Several attempts have been made to derive formalisms capable of describing both distributed systems and computations performed in such systems. A signiﬁcant amount of study has been devoted to deﬁning formalisms, which would ease the task of formally proving properties of distributed computation (e.g., absence of deadlock, liveness, etc.). The models proposed for systems of concurrent processes do provide both a formalism for describing a distributed computation and a proof system that

26

DISTRIBUTED COMPUTING ENVIRONMENTS

can be employed within the formalism; such is, for example, the Unity model of Mani Chandi and Jayadev Misra [1]. Other models, whose intended goal is still to provide a proof system, have been speciﬁcally tailored for distributed computations. In particular, the Input–Output Automata model of Nancy Lynch and Mark Tuttle [4] provides a powerful tool that has helped discover and ﬁx “bugs” in well-known existing protocols. For the investigators involved in the design and analysis of distributed algorithms, the main concern rests with efﬁciency and complexity; proving correctness of an algorithm is a compulsory task, but it is usually accomplished using traditional mathematical tools (which are generally considered informal techniques) rather than with formal proof systems. The formal models of computation employed in these studies, as well as in the one used in this book, mainly focus on those factors that are directly related to efﬁciency of a distributed computation and complexity of a distributed problem: the underlining communication network, the communication primitives, the amount and type of knowledge available to the processors, etcetera. Modal logic, and in particular the notion of common knowledge, is a useful tool to reason about distributed computing environments in presence of failures. The notion of knowledge used here was developed independently by Joseph Halpern and Yoram Moses [2], Daniel J. Lehmann [3], and Stanley Rosenschein [5]. The model we have described and will employ in this book uses reactive entities (they react to external stimuli). Several formal models (including input–output Automata) use instead active entities. To understand this fundamental difference, consider a message in transit toward an entity that is expecting it, with no other activity in the system. In an active model, the entity will attempt to receive the message, even while it is not there; each attempt is an event; hence, this simple situation can actually cause an unpredictable number of events. By contrast, in a reactive model, the entity does nothing; the only event is the arrival of the message that will “wake up” the entity and trigger its response. Using the analogy of waiting for the delivery of a pizza, in the active model, you (the entity) must repeatedly open the door (i.e., act) to see if the person supposed to deliver the pizza has arrived; in the reactive model, you sit in the living room until the bell rings and then go and open the door (i.e., react). The two models are equally powerful; they just represent different ways of looking at and expressing the world. It is our contention that at least for the description and the complexity analysis of protocols and distributed algorithms, the reactive model is more expressive and simpler to understand, to handle, and to use. 1.12 EXERCISES, PROBLEMS, AND ANSWERS 1.12.1 Exercises and Problems Exercise 1.12.1 Prove that the ﬂooding technique introduced in Section 1.5 is correct, that is, it terminates within ﬁnite time, and all entities will receive the information held by the initiator.

EXERCISES, PROBLEMS, AND ANSWERS

27

Exercise 1.12.2 Determine the exact number of message transmissions required by the protocol Flooding described in Section 1.5. Exercise 1.12.3 In Section 1.5 we have solved the broadcasting problem under the restriction of Bidirectional Links. Solve the problem using the Reciprocal Communication restriction instead. Exercise 1.12.4 In Section 1.5 we have solved the broadcasting problem under the restriction of Bidirectional Links. Solve the problem without this restriction. Exercise 1.12.5 Show that any protocol B can be rewritten so that SSTART consists of only one status. (Hint: Introduce a new input variable.) Exercise 1.12.6 Consider the muddy children problem discussed in Section 1.8.1. Show that, within ﬁnite time, all the children with a muddy forehead can simultaneously determine that they are not clean. (Hint: Use induction on k.) Exercise 1.12.7 Half-duplex links allow communication to go in both directions, but not simultaneously. Design a protocol that implements half-duplex communication between two connected entities, a and b. Prove its correctness and analyze its complexity. Exercise 1.12.8 Half-duplex links allow communication to go in both directions, but not simultaneously. Design a protocol that implements half-duplex communication between three entities, a, b and c, connected to each other. Prove its correctness and analyze its complexity.

1.12.2 Answers to Exercises Answer to Exercise 1.12.1 Let us prove that every entity will indeed receive the message. The proof is by induction on the distance d of an entity from the initiator s. The result is clearly true for d = 0. Assume that it is true for all entities at most at distance d. Let x be a process at distance d + 1 from s. Consider a shortest path s → x1 → . . . → xd−1 → x between s and x. As process xd−1 is at distance d − 1 from s, then by the induction assumption it receives the message. If xd−1 received the message from x, then this means that x already received the message and the proof is completed. Otherwise, xd−1 received the message from a different neighbor, and it then sends the message to all its neighbors, including x. Hence x will eventually receive the message. Answer to Exercise 1.12.2 The total number of messages sent without the improvement was x∈E |N (x)| = 2|E| = 2m; in Flooding, every entity (except the initiator) will send one message less. Hence the total number of messages is 2m − (|V | − 1) = 2m − n + 1.

28

DISTRIBUTED COMPUTING ENVIRONMENTS

Answer to Exercise 1.12.6 (Basis of Induction only) Consider ﬁrst the case k = 1: Only one child, say z, has a dirty forehead. In this case, z will see that everyone else has a clean forehead; as the teacher has said that at least one child has a dirty forehead, z knows that he/she must be the one. Thus, when the teacher arrives, he/she comes forward. Notice that a clean child sees that z is dirty but ﬁnds out that his/her own forehead is clean only when z goes forward. Consider now the case k = 2: There are two dirty children, a and b; a sees the dirty forehead of b and the clean one of everybody else. Clearly he/she does not know about his status; he/she knows that if he/she is clean, b is the only one who is dirty and will go forward when the teacher arrives. So, when the teacher comes and b does not go forward, a understands that his/her forehead is also dirty. (A similar reasoning is carried out by b.) Thus, when the teacher returns the second time, both a and b go forward.

BIBLIOGRAPHY [1] K.M. Chandi and J. Misra. Parallel Program Design: A Foundation. Addison-Wesley, 1988. [2] J.Y. Halpern and Y. Moses. Knowledge and common knowledge in a distributed environment. Journal of the A.C.M., 37(3):549–587, 1987. [3] D.J. Lehmann. Knowledge, common knowledge and related puzzles. In 3rd ACM Symposium on Principles of Distributed Computing, pages 62–67, Vancouver, 1984. [4] N.A. Lynch and M.R. Tuttle. Hierarchical correctness proofs of distributed algorithms. In 6th ACM Symposium on Principles of Distributed Computing (PODC), pages 137–151, Vancouver, 1987. [5] S.J. Rosenschein. Formal theories of AI in knowledge and robotics. New Generation Computing, 3:345–357, 1985.

CHAPTER 2

Basic Problems and Protocols

The aim of this chapter is to introduce some of the basic, primitive, computational problems and solution techniques. These problems are basic in the sense that their solution is commonly (sometimes frequently) required for the functioning of the system (e.g., broadcast and wake-up); they are primitive in the sense that their computation is often a preliminary step or a module of complex computations and protocols (e.g., traversal and spanning-tree construction). Some of these problems (e.g., broadcast and traversal), by their nature, are started by a single entity; in other words, these computational problems have, in their deﬁnition, the restriction unique initiator (UI). Other problems (e.g., wake-up and spanningtree construction) have no such restriction. The computational differences created by the additional assumption of a single initiator can be dramatic. In this chapter we have also included the discussions on the (multiple-initiators) computations in tree networks. Their fundamental importance derives from the fact that most global problems (i.e., problems that, to be solved, require the involvement of all entities), oftentimes can be correctly, easily, and efﬁciently solved by designing a protocol for trees and executing it on a spanning-tree of the network. All the problems considered here require, for their solution, the Connectivity (CN) restriction (i.e., every entity must be reachable from every other entity). In general, and unless otherwise stated, we will also assume Total Reliability (TR) and Bidirectional Links (BL). These three restrictions are commonly used together, and the set R = {BL, CN, TR} will be called the set of standard restrictions. The techniques we introduce in this chapter to solve these problems are basic ones; once properly understood, they form a powerful and an essential toolset that can be effectively employed by every designer of distributed algorithms. 2.1 BROADCAST 2.1.1 The Problem Consider a distributed computing system where only one entity, x, knows some important information; this entity would like to share this information with all the other entities in the system; see Figure 2.1. This problem is called broadcasting (Bcast), Design and Analysis of Distributed Algorithms, by Nicola Santoro Copyright © 2007 John Wiley & Sons, Inc.

29

30

BASIC PROBLEMS AND PROTOCOLS

FIGURE 2.1: Broadcasting Process.

and already we have started its examination in the previous chapter. To solve this problem means to design a set of rules that, when executed by the entities, will lead (within ﬁnite time) to a conﬁguration where all entities will know the information; the solution must work regardless of which entity has the information at the beginning. Built-in the deﬁnition of the problem, there is the assumption, Unique Initiator (UI), that only one entity will start the task. Actually, this assumption is further restricted, because the unique initiator must be the one with the initial information; we shall denote this restriction by UI+. To solve this problem, every entity must clearly be involved in the computation. Hence, for its solution, broadcasting requires the Connectivity (CN) restriction (i.e., every entity must be reachable from every other entity) otherwise some entities will never receive the information. We have seen a simple solution to this problem, Flooding, under two additional restrictions: Total Reliability (TR) and Bidirectional Links (BL). Recall that the set R = {BL, CN, TR} is the set of standard restrictions . 2.1.2 Cost of Broadcasting As we have seen, the solution protocol Flooding uses O(m) messages and, in the worst case, O(d) ideal time units, where d is the diameter of the network. The ﬁrst and natural question is whether these costs could be reduced signiﬁcantly (i.e., in order of magnitude) using a different approach or technique, and if so, by how much. This question is equivalent to ask what is the complexity of the broadcasting problem. To answer this type of questions we need to establish a lower bound: to ﬁnd a bound f (typically, a function of the size of the network) and to prove that the cost of every solution algorithm is at least f. In other words, a lower bound is needed irrespective of the protocol, and it depends solely on the problem; hence, it is an indication of how complex the problem really is. We will denote by M(Bcast/RI+) and T (Bcast/RI+) the message and the time complexity of broadcasting under RI+ = R ∪ UI+, respectively. A lower bound on the amount of ideal time units required to perform a broadcast is simple to derive: Every entity must receive the information regardless of how distant they are from the initiator, and any entity could be the initiator. Hence, in the worst case, T (Bcast/RI+) ≥ Max{d(x, y) : x, y ∈ V } = d.

(2.1)

BROADCAST

31

The fact that Flooding performs the broadcast in d ideal time units means that the lower bound is tight (i.e., it can be achieved) and that Flooding is time optimal. In other words, we know exactly the ideal time complexity of broadcasting: Property 2.1.1 The ideal time complexity of broadcasting under RI+ is ⌰(d). Let us now consider the message complexity. An obvious lower bound on the number of messages is also easy to derive: in the end, every entity must know the information; thus a message must be received by each of the n−1 entities, which initially did not have the information. Hence, M(Bcast/RI+) ≥ n − 1. With a little extra effort, we can derive a more accurate lower bound: Theorem 2.1.1 M(Bcast/RI+) ≥ m. Proof. Assume that there exists a correct broadcasting protocol A which, in each execution, under RI+ on every G, uses fewer than m(G) messages. This means that there is at least one link in G where no message is transmitted in any direction during an execution of the algorithm. Consider an execution of the algorithm on G, and let e = (x, y) ∈ E be the link where no message is transmitted by A. Now construct a new graph G from G by removing the edge e, and adding a new node z and two new edges e1 = (x, z) and e2 = (y, z) (see Fig. 2.2). Set z in a noninitiator status. Run exactly the same execution of A on the new graph G : since no message was sent along (x, y), this is possible. But since no message was sent along (x, y) in the original execution, x and y never send a message to z in the current execution. As a result, z will never receive the information (i.e., change status). This contradicts the fact that A is a correct broadcasting protocol. 䊏

FIGURE 2.2: A message must be sent on each link.

32

BASIC PROBLEMS AND PROTOCOLS

This means that any broadcasting algorithm requires ⍀(m) messages. Since Flooding solves broadcasting with 2m − n + 1 messages (see Exercise 2.9.1), this implies M(Bcast/RI+) ≤ 2m − n + 1. Since the upper bound and the lower bound are of the same order of magnitude, we can summarize Property 2.1.2 The message complexity of broadcasting under RI+ is ⌰(m). The immediate consequence is that, in order of magnitude, Flooding is a messageoptimal solution. Thus, if we want to design a new protocol to improve the 2m − n + 1 cost of Flooding, the best we can hope to achieve is to reduce the constant 2; in any case, because of Theorem 2.1.1, the reduction cannot bring the constant below 1.

2.1.3 Broadcasting in Special Networks The results we have obtained so far apply to generic solutions; that is, solutions that do not depend on G and can thus be applied regardless of the communication topology (provided it is undirected and connected). Next, we will consider performing the broadcast in special networks. Throughout we will assume the standard restrictions plus UI+. Broadcasting in Trees Consider the case when G is a tree; that is, G is connected and contains no cycles. In a tree, m = n−1; hence, the use of protocol Flooding for broadcasting in a tree will cost 2m − (n − 1) = 2(n − 1) − (n − 1) = n − 1 messages. IMPORTANT. This cost is achieved even if the entities do not know that the network is a tree. IMPORTANT. An interesting side effect of broadcasting on a tree is that the tree becomes rooted in the initiator of the broadcast. Broadcasting in Oriented Hypercubes A communication topology that is commonly used as an interconnection network is the (k-dimensional) labeled hypercube, denoted by Hk . A oriented hypercube H1 of dimension k = 1 is just a pair of nodes called (in binary) “0” and “1,” connected by a link labeled “1” at both nodes. A hypercube Hk of dimension k > 1 is obtained by taking two hypercubes of and Hk−1 –and connecting the nodes with the same name dimension k − 1–Hk−1 (respecwith a link labeled k at both nodes; the name of each node in Hk−1 tively Hk−1 ) is then modiﬁed by preﬁxing it with the bit 0 (respectively, 1); see Figure 2.3.

BROADCAST

1

00 1

0

1

2 1

10

3

000

010

2 001

1

3

2

01

2

100

3

2

11

101

1

110

33

2

1

111

3

1 011

FIGURE 2.3: Oriented Hypercube Networks

So, for example, node “0010” in H4 will be connected to node “0010” in H4 by a link labeled l = 5, and their names will become “00010” and “10010,” respectively. This labeling l of the links is symmetric (i.e., lx (x, y)= ly (x, y)) and is called the dimensional labeling of a hypercube. IMPORTANT. These names are used only for descriptive purposes; they are not known to the entities. By contrast, the labels of the links (i.e., the port numbers) are known to the entities by the Local Orientation axiom. A hypercube of dimension k has n = 2k nodes; each node has k links, labeled 1, 2, . . . , k. Hence the total number of links is m = nk/2 = (n/2) log n = O(n log n). A straightforward application of Flooding in a hypercube will cost 2m − (n − 1) = n log n − (n − 1) = n log n/2 + 1 = O(n log n) messages. However, hypercubes are highly structured networks with many interesting properties. We can exploit these special properties to construct a more efﬁcient broadcast. Obviously, if we do so, the protocol cannot be used in other networks. Consider the following simple strategy.

34

BASIC PROBLEMS AND PROTOCOLS

Strategy HyperFlood: 1. The initiator sends the message to all its neighbors. 2. A node receiving a message from the link labeled l will send the messages only to those neighbors with label l < l. NOTE. The only difference between HyperFlood and the normal Flooding is in step 2: Instead of sending the message to all neighbors except the sender, the entity will forward it only to some of them, which will depend on the label of the port from where the message is received. As we will see, this strategy correctly performs the broadcast using only n − 1 messages (instead of O(n log n)). Let us ﬁrst examine termination and correctness. Let Hk (x) denote the subgraph of Hk induced by the links where messages are sent by HyperFlood when x is the initiator. Clearly every node in Hk (x) will receive the information. Lemma 2.1.1

HyperFlood correctly terminates.

Proof. Let x be the initiator; starting from x, the messages are sent only on links with decreasing labels, and if y receives the message from link 4 it will forward it only to the ports 1, 2, and 3. To prove that every entity will receive the information sent by x, we need to show that, for every node y, there is a path from x to y such that the sequence of the labels on the path from x to y is decreasing. (Note that the labels on the path do not need to be consecutive integers.) To do so we will use the following property of hypercubes. Property 2.1.3 In a k-dimensional hypercube Hk , any node x is connected to any other node y by a path π ∈ ˙[x, y] such that ⌳(π) is a decreasing sequence. Proof. Consider the k-bit names of x and of y in Hk : xk , xk−1 , . . . , x1 , x0 and yk , yk−1 , . . . , y1 , y0 . If x = y, these two strings will differ in t ≥ 1 positions. Let j1 , j2 , . . . , jt be the positions in decreasing order; that is, ji > ji+1 . Consider now the nodes v0 , v1 , v2 , . . . , vt , where v0 = x, and the name of vi differs from the name of vi+1 only in the ji+1 -th position. Thus, there is a link labeled ji+1 connecting vi to vi+1 , and clearly vt = y. But this means that v0 , v1 , v2 , . . . , vt is a path from x to y, and the sequence of labels on this path is j1 , j2 , . . . , jt , which is decreasing. 䊏 Thus, Hk (x) is connected and spans (i.e., it contains all the nodes of) Hk , regardless of x. In other words, within ﬁnite time, every entity will have the information. 䊏 Let us now concentrate on the cost of HyperFlood. First of all observe that M[HyperFlood/Hk ] = n − 1.

(2.2)

BROADCAST

35

To prove that only n − 1 messages will be sent during the broadcast, we just need to show that every entity will receive the information only once. This is true because, for every x, Hk (x) contains no cycles (see Exercise 2.9.9). Also as an exercise it is left the proof that for every x, the eccentricity of x in Hk (x) is k (see Exercise 2.9.10); this implies that the ideal time delay of HyperFlood in Hk is always k. That is, T[HyperFlood/Hk ] = k

(2.3)

These costs are the best that any broadcast algorithm can perform in a hypercube regardless of how much more knowledge they have. However, they are obtained here under the additional restriction that the network is a k-dimensional hypercube with a dimensional labeling; that is, under H = {(G, l) = Hk }. Summarizing, we have Property 2.1.4 The ideal time complexity of broadcasting in a k-dimensional hypercube with a dimensional labeling under RI+ is ⌰(k).

Property 2.1.5 The message complexity of broadcasting in a k-dimensional hypercube with a dimensional labeling under RI+ is ⌰(n).

IMPORTANT. The reason why we are able to “bypass” the ⍀(m) lower bound expressed by Theorem 2.1.1 is because we are restricting the applicability of the protocol.

Broadcasting in Complete Graphs Among all network topologies, the complete graph is the one with the most links: Every entity is connected to all others; thus m = n(n − 1)/2 = O(n2 ) (recall we are considering bidirectional links), and d = 1. The use of a generic protocol will require O(n2 ) messages. But this is really unnecessary. Broadcasting in a complete graph is easily accomplished: Because everybody is connected to everybody else, the initiator just needs to send the information to its neighbors (i.e., execute the command “send(I) to N(x)”) and the broadcast is completed. This uses only n − 1 messages and d = 1 ideal time. Clearly this protocol, KBcast, works only in a complete graph, that is under the additional restriction K ≡ “G is a complete graph.” Summarizing Property 2.1.6 The message and the ideal time complexity of broadcasting in a complete graph under RI+ is ⌰(k) are M(Bcast/RI+ ; K) = n − 1 and T (Bcast/RI+ ; K) = 1, respectively.

36

BASIC PROBLEMS AND PROTOCOLS

FIGURE 2.4: Wake-Up Process.

2.2 WAKE-UP 2.2.1 Generic Wake-Up Very often, in a distributed environment, we are faced with the following situation: A task must be performed in which all the entities must be involved; however, only some of them are independently active (because of a spontaneous event, or having ﬁnished a previous computation) and ready to compute, the others are inactive, not even aware of the computation that must take place. In these situations, to perform the task, we must ensure that all the entities become active. Clearly, this preliminary step can only be started by the entities that are active already; however, they do not know which other entities (if any) are already active. This problem is called Wake-up (Wake-Up): An active entity is usually called awake, an inactive (still) one is called asleep; the task is to wake all entities up; see Figure 2.4. It is not difﬁcult to see the relationship between broadcasting and wake-up: Broadcast is a wake-up with only one initially awake entity; conversely, wake-up is a broadcast with possibly many initiators (i.e., initially more than one entity has the information). In other words, broadcast is just a special case of the wake-up problem. Interestingly, but not surprisingly, the ﬂooding strategy used for broadcasting actually solves the more general Wake-Up problem. The modiﬁed protocol, called WFlood, is described in Figure 2.5. Initially all entities are asleep; any asleep entity can become spontaneously awake and start the protocol. It is not difﬁcult to verify that the protocol correctly terminates under the standard restrictions (Exercise 2.9.7). Let us concentrate on the cost of protocol WFlood. The number of messages is at least equal to that of broadcast; actually, it is not much more (see Exercise 2.9.6): 2m ≥ M[WFlood] ≥ 2m − n + 1.

(2.4)

As broadcast is a special case of wake-up, not much improvement is possible (except perhaps in the size of the constant): M(Wake-Up/R) ≥ M(Bcast/RI+) = ⍀(m) The ideal time will, in general, be smaller than the one for broadcast: T (Bcast/RI+) ≥ T (Wake-Up/R)

WAKE-UP

37

PROTOCOL WFlood .

Status Values: S = {ASLEEP,AWAKE}; SI NI T = {ASLEEP}; ST ERM = {AWAKE}.

Restrictions: R. ASLEEP

Spontaneously begin send(W ) to N (x); become AWAKE; end Receiving(W) begin send(W) to N (x) − {sender}; become AWAKE; end

FIGURE 2.5: Wake-Up by Flooding

However, in the case of a single initiator, the two cases coincide. As upper and lower bounds coincide in order of magnitude, we can conclude that protocol WFlood is both message and, worst case in the time optimal. The complexity of Wake-Up is summarized by the following two properties, Property 2.2.1 The message complexity of Wake-up under R is ⌰(m). Property 2.2.2 The worst case ideal time complexity of Wake-up under R is ⌰(d). 2.2.2 Wake-Up in Special Networks Trees The cost of using protocol WFlood for wake-up will depend on the number of initiators. In fact, if there is only one initiator, then this is just a broadcast and costs only n − 1 messages. By contrast, if every entity starts independently, there will be a total of 2(n − 1) messages. Let k denote the number of initiators; note that this number is not a system parameter like n or m, it is, however, bounded by a system parameter: k ≤ n. Then the total number of messages when executing WFlood in a tree will be exactly M[WFlood/Tree] = n + k − 2.

(2.5)

Labeled Hypercubes In Section 2.1, by exploiting the properties of the hypercube and of the dimensional labeling, we have been able to construct a broadcast protocol, which uses only O(n) messages, instead of the ⍀(n log n) messages required by any generic protocol.

38

BASIC PROBLEMS AND PROTOCOLS

Let us see if we can achieve a similar result also for the wake-up. In other words, can we exploit the properties of a labeled hypercube to do better than generic protocols? The answer is, unfortunately, NO. Lemma 2.2.1

M(Wake-Up/R ; H ) = ⍀(n log n).

As a consequence, we might as well employ the generic protocol WFlood, which uses O(n log n) messages. Summarizing, Property 2.2.3 The message complexity of wake-up under R in a k-dimensional hypercube with a dimensional labeling is ⌰(n log n). Complete Graphs Let us focus on wake-up in a complete graph. The use of the generic protocolWFlood will require O(n2 ) messages. We can obviously use the simpliﬁed broadcast protocol KBcast we developed for complete graphs. The number of messages transmitted will be k (n − 1), where k denotes the number of initiators. Even in the worst case (when every entity is independently awake and they all simultaneously start the protocol) O(n2 ) messages will be transmitted. Let us see if, by exploiting the properties of complete graphs, we have been able to construct a wake-up protocol that uses only O(n) messages, instead of the O(n2 ) we have achieved so far. (After all, we have been able to do it in the case of the broadcast problem.) Surprisingly, also in this case, the answer is NO. Lemma 2.2.2

M(Wake-Up/R ; K) = ⍀(n2 ).

This implies that the use of WFlood for wake-up is a message-optimal solution. In other words, Property 2.2.4 The message complexity of wake-up under R in a complete network is ⌰(n2 ). Complete Graphs with ID To reduce the number of messages, a more restricted environment is required; that is, we need to make additional assumptions. For example, if we add the restriction that the entities have unique names (restriction Initial Distinct values (ID)), then there are protocols capable of performing wake-up with O(n log n) messages in a complete graph; they are not simple and actually solve a much more complex problem, Election, which we will discuss at length in Chapter 3. Strangely, nothing better than that can be accomplished. In fact, let IR + K = R ∪ K; then the worst case message complexity of wake-up in a complete graph under the standard restrictions R plus ID is as follows: Property 2.2.5 M(Wake-Up/R; ID; K) ≥ 0.5n log n.

WAKE-UP

39

To see why this is true, we will construct a “bad” but possible case, which any protocol can encounter, and show that, in such a case, O(n log n) messages will be exchanged. The lower bound will hold even if there is message ordering. For simplicity of discussion and calculation, we will assume that n is a power of 2; the results hold also if this is not the case. To construct the “bad” case for an (arbitrary) solution protocol A, we will consider a game between the entities on one side and an adversary on the other: the entities obey the rules of the protocol; the adversary will try to make the worst possible scenario occur, so, to force the use of as many messages as possible. The adversary has the following four powers: 1. it decides the initial values of the entities (they must be distinct); 2. it decides which entities spontaneously start the execution of A, and when; 3. it decides when a transmitted message arrives (it must be within ﬁnite time); and 4. importantly, it decides the matching between links and labels: Let e1 , e2 , . . . , ek be the links incident on x, and let l1 , l2 , . . . , lk be the port labels to be used by x for those links; during the execution, when x performs a “send to l” command, and l has not been assigned yet, the adversary will choose which of the unused links (i.e., through which no messages has been sent nor received) the label l will be assigned to. NOTE. Sending a message to more than one port will be treated as sending the message to each of those ports one at a time (in an arbitrary order). Whatever the adversary decides, it can happen in a real execution. Let us see how bad a case can the adversary create for A. Two sets of entities will be said to be connected at a time t if at least a message has been transmitted from an entities of one set to an entity of the other. Adversary’s Strategy. 1. Initially, the adversary will wake up only one entity s, which we will call the seed, and which will start the execution of the protocol. When s decides to send a message to port number l, the adversary will wake up another entity y and assign label l to the edge from s to y. It will then delay the transmission on that link until also y decides to send a message to some port number l ; the adversary will then assign label l to the link from y to s and let the two messages arrive to their destination simultaneously. In this way, each message will reach an awake node, and the two entities are connected. From now on, the adversary will act in a similar way; always ensure that messages are sent to already-awake nodes, and that the set of awake nodes is connected.

40

BASIC PROBLEMS AND PROTOCOLS

2. Consider an entity x executing a send operation to an unassigned label a. (a) If x has an unused link (i.e., a link on which no messages have been sent so far) connecting it to an awake node, the adversary will assign a to that link. In other words, the adversary will always try to make the awake entities send messages to other awake entities. (b) If all links between x and the awake nodes have been used, then the adversary will create another set of awake nodes and connect the two sets. i. Let x0 , . . . , xk−1 be the currently awake nodes, ordered according to their wake-up time (thus, x0 = s is the seed, and x1 = y). The adversary will perform the following function: choose k inactive nodes z0 , . . . , zk−1 ; establish a logical correspondence between xj and zj ; assign initial values to the new entities so that the order among them is the same as the one among the values of the corresponding entities; wake up these entities and force them to have the “same” execution (same scheduling and same delays) as already did the corresponding ones. (So, z0 will be woken up ﬁrst, its ﬁrst message will be sent to z1 , which will be woken up next and will send a message to z0 , and so forth) ii. The adversary will then assign label a to the link connecting x to its corresponding entity z in the new set; the message will be held in transit until z (like x did) will need to transmit a message on an unused link (say, with label b) but all the edges connecting it to its set of awake entities have already been used. iii. When this happens, the adversary will assign the label b to the link from z to x and make the two messages between x and z arrive and be processed. Let us summarize the strategy of the adversary: The adversary tries to force the protocol to send messages only to already-awake entities and awakens new entities only when it cannot do otherwise; the newly awake entities are equal in number to the already awake entities; and they are forced by the adversary to have the same execution between them as did the other entities before any communication takes place between the two sets. When this happens, we will say that the adversary has started a new stage. Let us now examine the situations created by the adversary with this strategy and analyze the cost of the protocol in the corresponding executions. Let Active(i) denote the awake entities in stage i and New(i) = Active(i) − Active(i − 1) the entities that the adversary woke up in this stage; initially, Active(0) is just the seed. The newly awake entities are equal in number to the already awake entities; that is, |New(i)| = |Active(i − 1)|). Let µ(i − 1) denote the total number of messages, which have been exchanged before the activation of the new entities. The adversary forces the new entities to have the same execution as did the entities in Active(i − 1), thus exchanging µ(i − 1) of messages, before allowing the two sets to become connected. Thus, the total number of messages until the communication between the two sets takes place is 2µ(i − 1).

TRAVERSAL

41

Once the communication takes place, how many messages (including those two) are transmitted before the next stage? The exact answer will depend on the protocol A, but regardless of which protocol we are using, the adversary will not start a new stage i + 1 unless it is forced to; this will happen only if an entity x issues a “send to l” command (where l is an unassigned label) and all the links connecting x to the other awake entities have already been used. This means that x must have either sent to or received from all the entities in Active(i) = Active(i − 1) ∪ New(i). Assume that x ∈ Active(i − 1); then, of all these messages, the ones between x and New(i) have only occurred in stage i (since those entities were not active before); this means that at least |New(i)| = |Active(i −1)| additional messages are sent before stage i + 1. If instead x ∈ New(i), these messages have all been transmitted in this stage (as x was not awake before); in other words, even in this case, |New(i)| = |Active(i −1)| additional messages are sent before stage i + 1. Summarizing, the total cost µ(i − 1) before stage i is thus doubled and at least additional |Active(i −1)| messages are sent before stage i + 1. In other words, µ(i) ≥ 2 µ(i − 1) + |Active(i −1)|. As the awake entities double in each stage, and initially only the seed is active, then |Active(i)| = 2i . Hence, observing that µ(0) = 0, µ(i) ≥ 2 µ(i − 1) + 2i−1 ≥ i 2i−1 . The total number of stages is exactly log n as the awake processes double every stage. Hence, with this strategy, the adversary can force any protocol to transmit at least µ(log n) messages. As µ(log n) ≥ 0.5 n log n it follows that any wake-up protocol will transmit ⍀(n log n) messages in the worst case even if the entities have distinct identiﬁers (ids). More efﬁcient wake-up protocols can be derived if we have in our system a “good” labeling of the links instead.

2.3 TRAVERSAL Traversal of the network allows every entity in the network to be “visited” sequentially (one after the other). Its main uses are in the control and management of a shared resource and in sequential search processes. In abstract terms, the traversal problem starts with an initial conﬁguration where all entities are in the same state (say unvisited) except the one that is visited and is the sole initiator; the goal is to render all the entities visited but sequentially (i.e., one at the time). A traversal protocol is a distributed algorithm that, starting from the single initiator, allows a special message called “traversal token” (or simply, token), to reach every

42

BASIC PROBLEMS AND PROTOCOLS

entity sequentially (i.e., one at the time). Once a node is reached by the token, it is marked as “visited.” Depending on the traversal strategy employed, we will have different traversal protocols. 2.3.1 Depth-First Traversal A well known strategy is the depth-ﬁrst traversal of a graph. According to this strategy, the graph is visited (i.e., the token is forwarded) trying to go forward as long as possible; if it is forwarded to an already visited node, it is sent back to the sender, and that link is marked as a back-edge; if the token can no longer be forwarded (it is at a node where all its neighbors have been visited), the algorithm will “backtrack” until it ﬁnds an unvisited node where the token can be forwarded to. The distributed implementation of depth-ﬁrst traversal is straightforward. 1. When ﬁrst visited, an entity remembers who sent the token, creates a list of all its still unvisited neighbors, forwards the token to one of them (removing it from the list), and waits for its reply returning the token. 2. When the neighbor receives the token, it will return the token immediately if it had been visited already by somebody else, notifying that the link is a backedge; otherwise, it will ﬁrst forward the token to each of its unvisited neighbors sequentially, and then reply returning the token. 3. Upon the reception of the reply, the entity forwards the token to another unvisited neighbor. 4. Should there be no more unvisited neighbors, the entity can no longer forward the token; it will then send the reply, returning the token to the node from which it ﬁrst received it. NOTE. When the neighbor in step (2) determines that a link is a back-edge , it knows that the sender of the token is already visited; thus, it will remove it from the list of unvisited neighbors. We will use three types of messages: “T” to forward the token in the traversal, “Backedge” to notify the detection of a back-edge, and “Return” to return the token upon local termination. Protocol DF Traversal is shown in Figure 2.6, where the operation of extracting an element from a set B and assigning it to variable a is denoted by a ⇐ B. Let us examine its costs. Focus on a link (x,y)∈ E. What messages can be sent on it? Suppose x sends T to y; then y will only send to x either Return (if it was idle when the T arrived) or Backedge (otherwise). In other words, on each link there will be exactly two messages transmitted. Since the traversal is sequential, T[DF Traversal ] = M[DF Traversal ]; hence T[DF Traversal] = M[DF Traversal] = 2m.

(2.6)

TRAVERSAL

43

PROTOCOL DF Traversal.

Status: S = {INITIATOR,IDLE,VISITED,DONE}; SINIT = {INITIATOR,IDLE}; STERM = {DONE}.

Restrictions: R ;UI. INITIATOR

Spontaneously begin Unvisited:= N (x); initiator:= true; VISIT; end IDLE Receiving (T ) begin entry: = sender; Unvisited: = N (x) − {sender}; initiator: = false; VISIT; end VISITED Receiving (T ) begin Unvisited: = Unvisited −{sender}; send(Backedge) to {sender}; end Receiving(Return) begin VISIT; end Receiving(Backedge) begin VISIT; end Procedure VISIT begin if Unvisited = ∅ then next ⇐ Unvisited; send(T) to next; become VISITED else if not(initiator) then send(Return) to entry; endif become DONE; endif end

FIGURE 2.6: DF Traversal

To determine how efﬁcient is the protocol, we are going to determine what is the complexity of the problem. Using exactly the same technique we employed in the proof of Theorem 2.1.1, we have (Exercise 2.9.11): Theorem 2.3.1 M(DFT/R) ≥ m.

44

BASIC PROBLEMS AND PROTOCOLS

Therefore, the 2m message cost of protocol DF Traversal is indeed excellent, and the protocol is message optimal. Property 2.3.1 The message complexity of depth-ﬁrst traversal under R is ⌰(m). The time requirements of a depth-ﬁrst traversal are quite different from those of a broadcast. In fact, since each node must be visited sequentially, starting from the sole initiator, the time complexity is at least the number of nodes: Theorem 2.3.2 T (DFT/R) ≥ n − 1. The time complexity of protocol DF Traversal is dreadful. In fact, the upper bound 2m could be several order of magnitude larger than the lower bound n − 1. For example, in a complete graph, 2m = n2 − n. Some signiﬁcant improvements in the time complexity can, however, be made by going into a ﬁner granularity. We will discuss this topic in greater details next. 2.3.2 Hacking () Let us examine protocol Protocol DF Traversal to see if it can be improved, especially its time cost. IMPORTANT. When measuring ideal time, we consider only synchronous executions; however, when measuring messages and establishing correctness we must consider every possible schedule of events, especially the nonsynchronous executions. Basic Hacking The protocol we have constructed is totally sequential: in a synchronous execution, at each time unit only one message will be sent, and every message requires one unit of time. So, to improve the time complexity, we need to (1) reduce the number of messages and/or (2) introduce some concurrency. By deﬁnition of traversal, each entity must receive the token (message T) at least once. In the execution of our protocol, however, some entities receive it more than once; those links from which these other T messages arrive are precisely the backedges. Question. Can we avoid sending T messages on back-edges? To answer this question we must understand why T messages are sent on back-edges. When an entity x sends a T message to y, it does not know whether the link is a back-edge or not; that is, whether y has already been visited by somebody else or not. If x knew which of its neighbors are already visited, it would not send a T message to them, there would be no need for Backedge messages from them, and we would be saving messages and time. Let us examine how to achieve such a condition.

TRAVERSAL

45

Suppose that, whenever a node is visited (i.e., it receives T) for the ﬁrst time, it notiﬁes all its (other) neighbors of this event (e.g., sending a “Visited” message) and waits for an acknowledgment (e.g., receiving an “Ack” message) from them before forwarding the token. The consequence of such a simple act is that now an entity ready to forward the token (i.e., to send a T message) really knows which of its neighbors have already been visited. This is exactly what we wanted. The price we have to pay is the transmission of the Visited and Ack messages. Notice that now an idle entity (that is an entity that has not yet been involved in the traversal) might receive a Visited message as its ﬁrst message. In the revised protocol, we will make such an entity enter a new status, available. Let us examine the effects of this change on the overall time cost of the protocol; call DF+ the resulting protocol. The time is really determined by the number of sequential messages. There are four types of messages that are sent: T, Return, Visited, and Ack. Each entity (except the initiator) will receive only one T message and send only one Return message; the initiator does not receive any T message and does not send any Return; thus, in total there will be 2(n − 1) such messages. Since all these communications occur sequentially (i.e., without any overlap), the time taken by sending the T and Return messages will be 2(n − 1). To determine how many ideal time units are added by the transmission of Visited and Ack messages, consider an entity: its transmission of all the Visited messages takes only a single time unit, since they are sent concurrently; the corresponding Ack messages will also be sent concurrently, adding an additional time unit. Since every node will do it, the sending of the Visited messages and receiving the Ack messages will increase the ideal time of the original algorithm by exactly 2n. This will give us a time cost of T[DF+] = 4n − 2.

(2.7)

It is also easy to compute how many messages this will cost. As mentioned above, there is a total of 2(n − 1) T and Return messages. In addition, each entity (except the initiator) sends a Visited message to all its neighbors except the one from which it received the token; the initiator will send it to all its neighbors. Thus, denoting by s the initiator, the total number of Visited messages is |N (s)| + x=s (|N (x)| − 1) = 2m − (n − 1). Because for each Visited message there will be an Ack, the total message cost will be M[DF+] = 4m − 2(n − 1) + 2(n − 1) = 4m.

(2.8)

Summarizing, we have been able to reduce the time costs from O(m) to O(n) that, because of Theorem 2.3.2, is optimal. The price has been the doubling of the number of messages. Property 2.3.2 The ideal time complexity of depth-ﬁrst traversal under R is ⌰(n).

46

BASIC PROBLEMS AND PROTOCOLS

Advanced Hacking Let us see if the number of messages can be decreased without signiﬁcantly increasing the time costs. Question. Can we avoid sending the Ack messages? To answer this question we must understand what would happen if we do not send Ack messages. Consider an entity x that sends Visited to its neighbors; (if we no longer use Ack) x will proceed immediately with forwarding the token. Assume that, after some time, the token arrives, for the ﬁrst time, to a neighbor z of x (see Fig. 2.7); it is possible that the Visited message sent by x to z has not arrived yet (due to communication delays). In this case, z would not know that x has already been visited and would send the T message to it. That is, we will again send a T message on a back-edge undoing what we had accomplished with the previous change to the protocol. But the algorithm now is rather different (we are using Visited messages, no longer Backedge messages) and this situation might not happen all the time. Still, if it happens, z will eventually receive the Visited message from x (recall we are operating under total reliability); z can then understand its mistake, pretend nothing happened (just the waste of a T message), and continue like T message was never really sent. On its side, x upon receiving the token will also understand that z made a mistake and ignore the message; x also realizes (if it did not know already) that z is visited and will remove it from its list of unvisited neighbors. Although the correctness will not be affected (Exercise 2.9.15), mistakes cost additional messages. Let us examine what is really the cost of this modiﬁed protocol, which we shall call DF++. As before, the “correct” T and Return yield a total of 2n − 2 messages, and the Visited messages are 2m − n + 1 in total. Then there are the “mistakes”; each mistake costs one message. The number of mistakes can be very large. In fact, unfriendly time delays can force mistakes to

X

X

X

T Visited

Y

Visited

Y

Visited

Y T

T Z

(a)

Z

(b)

Z

(c)

FIGURE 2.7: Slow Visited message : z does not know that x has been visited.

TRAVERSAL

47

occur on every back-edge; on some back-edges, there can be two mistakes, one in each direction. (Exercise 2.9.16). In other words, there will be at most 2(m − n + 1) incorrect T messages. Summing up all, this yields M[DF++] ≤ 4m − n + 1.

(2.9)

Let us consider now the time. We have an improvement in that the Ack messages are no longer sent, saving n time units. As there are no more Ack to wait for, an entity can forward the token at the same time as the transmission of the Visited messages; if it does not have any unvisited neighbor to send the T to, the entity will send the Return at the same time as the Visited. Hence, the sending of the Visited is done in overlap with the sending of either a T or a Return message, saving another n time units. In other words, without considering the mistakes, the total time will be 2n − 2. Let us now also consider the mistakes and evaluate the ideal time of the protocol. Strange as it might sound, when we attempt to measure the ideal execution time of this protocol, in the execution no mistakes will ever occur. This is because mistakes can only occur owing to arbitrarily long communication delays; on the contrary, ideal time is only measured under unitary delays. But under unitary delays there are no mistakes. Therefore, T[DF++] = 2n − 2.

(2.10)

IMPORTANT. It is crucial to understand this inherent limit of the cost measure we call ideal time. Unlike the number of messages, ideal time is not a “neutral” measure; it inﬂuences (thus limiting) the nature of what we want to measure. In other words, it should be treated and handled with caution. Even greater caution should be employed in interpreting the results it gives. Extreme Hacking As we are on a roll, let us observe that we could actually use the T message as an implicit Visited, saving some additional messages. This saving will happen at every entity except those that, when they are reached for the ﬁrst time by a T message, do not have any unvisited neighbor. Let f denote the number of these nodes; thus the number of Visited messages we save is n − f . Hence, the total number of messages is 4m − n + 1 − n + f . Summarizing, the cost of the optimized protocol, called DF and described in Figures 2.8 and 2.9, is as follows: T[DF] = 2n − 2.

(2.11)

M[DF] = 4m − 2n + f + 1.

(2.12)

48

BASIC PROBLEMS AND PROTOCOLS

PROTOCOL DF

Status: S = {INITIATOR,IDLE,AVAILABLE,VISITED,DONE}; SI NI T = {INITIATOR,IDLE}; ST ERM = {DONE}.

Restrictions: R ;UI. INITIATOR

Spontaneously begin initiator:= true; Unvisited:= N (x); next ⇐ Unvisited; send(T) to next; send(Visited) to N(x)-{next}; become VISITED end IDLE Receiving(T ) begin Unvisited:= N (x); FIRST-VISIT; end Receiving(Visited) begin Unvisited:= N (x) − {sender}; become AVAILABLE end AVAILABLE Receiving(T) FIRST-VISIT; Receiving(Visited) begin Unvisited:= U nvisited − {sender}; end VISITED Receiving(Visited) begin Unvisited:= Unvisited −{sender}; if next = sender then VISIT; endif end Receiving(T) begin Unvisited:= Unvisited −{sender}; if next = sender then VISIT; endif end Receiving(Return) begin VISIT; end

FIGURE 2.8: Protocol DF

TRAVERSAL

49

Procedure FIRST-VISIT begin initiator:= false; entry:=sender; Unvisited:= Unvisited-{sender}; if Unvisited = ∅ then next ⇐ Unvisited; send(T) to next; send(Visited) to N(x)−{entry,next}; become VISITED; else send(Return) to {entry}; send(Visited) to N(x)−{entry}; become DONE; endif end Procedure VISIT begin if Unvisited = ∅ then next ⇐ Unvisited; send(T) to next; else if not(initiator) then send(Return) to entry; endif become DONE; endif end

FIGURE 2.9: Routines used by Protocol DF*

IMPORTANT. The value of f , unlike n and m, is not a system parameter. In fact, it is execution-dependent.: it may change at each execution value. We shall indicate this fact (for f as well as for any other execution-dependent value) by the use of the subscript .

2.3.3 Traversal in Special Networks Trees In a tree network, depth-ﬁrst traversal is particularly efﬁcient in terms of messages, and there is no need of any optimization effort (hacking). In fact, in any execution of DF Traversal in a tree, no Backedge messages will be sent (Exercise 2.9.12). Hence, the total number of messages will be exactly 2(n − 1). The time complexity is the same as the optimized version of the protocol: 2(n − 1). M[DF Traversal/Tree] = T[DF Traversal/Tree] = 2n − 2

(2.13)

An interesting side effect of a depth-ﬁrst traversal of a tree is that it constructs a virtual ring on the tree (Figure 2.10). In this ring some nodes appear more than once; in fact the ring has size 2n − 2 (Exercise 2.9.13). This fact will have useful consequences.

50

BASIC PROBLEMS AND PROTOCOLS

a

Virtual Node Real Node d

b

c e

f

g

h

FIGURE 2.10: Virtual ring created by DF Traversal.

Rings In a ring network, every node has exactly two neighbors. Depth-ﬁrst traversal in a ring can be achieved in a simple way: the initiator chooses one direction and the token is just forwarded along that direction; once the token reaches the initiator, the traversal is completed. In other words, each entity will send and receive a single T message. Hence both the time and the message costs are exactly n. Clearly this protocol can be used only in rings. Complete Graph In a complete graph, execution of DF* will require O(n2 ) messages. Exploiting the knowledge of being in a complete network, a better protocol can be derived: the initiator sequentially will send the token to all its neighbors (which are the other entities in the network); each of this entities will return the token to the initiator without forwarding it to anybody else. The total number of messages is 2(n − 1), and so is the time. 2.3.4 Considerations on Traversal Traversal as Access Permission The main use of a traversal protocol is in the control and management of shared resources. For example, access to a shared transmission medium (e.g., bus) must be controlled to avoid collisions (simultaneous frame transmission by two or more entities). A typical mechanism to achieve this is by the use of a control (or permission) token. This token is passed from one entity to another according to the same set of rules. An entity can only transmit a frame when it is in possession of the token; once the frame has been transmitted, the token is passed to another entity. A traversal protocol by deﬁnition “passes” the token sequentially through all the entities and thus solves the access control problem. The only proviso is that, for the access permission problem, it must be made continuous: once a traversal is terminated, another must be started by the initiator.

PRACTICAL IMPLICATIONS: USE A SUBNET

51

The access permission problem is part of a family of problems commonly called Mutual Exclusion, which will be discussed in details later in the book. Traversal as Broadcast It is not difﬁcult to see that any traversal protocol solves the broadcast problem: the initiator puts the information in the token message; every entity will be visited by the token and thus will receive the information. The converse is not necessarily true; for example, Flooding violates the sequentiality requirement since the message is sent to all (other) neighbors simultaneously. The use of traversal to broadcast does not lead to a more efﬁcient broadcasting protocol. In fact, a comparison of the costs of Flooding and DF* (Expressions 1.1 and 2.12) shows that Flooding is more efﬁcient in terms of both messages and ideal time. This is not surprising since a traversal is constrained to be sequential; ﬂooding, by contrast, exploits concurrency at its outmost.

2.4 PRACTICAL IMPLICATIONS: USE A SUBNET We have considered three basic problems (broadcast, wake-up, and depth-ﬁrst traversal) and studied their complexity, devised solution protocols and analyzed their efﬁciency. Let us see what the theoretical results we have obtained tell us about the situation from a practical point of view. We have seen that generic protocols for broadcasting and wake-up require ⍀(m) messages (Theorem 2.1.1). Indeed, in some special networks, we can sometimes develop topology-dependent solutions and obtain some improvements. A similar situation exists for generic traversal protocols: They all require ⍀(m) messages (Theorem 2.3.1); this cost cannot be reduced (in order of magnitude) unless we make additional restrictions, for example, exploiting some special properties of G of which we have a priori (i.e., at design time) knowledge. In any connected, undirected graph G, we have (n2 − n)/2 ≥ m ≥ n − 1, and, for every value in that range, there are networks with those many links; in particular, m = (n2 − n)/2 occurs when G is the complete graph, and m = n − 1 when G is a tree. Summarizing, the cost of broadcasting, wake-up, and traversal depends on the number of links: The more links the greater the cost; and it can be as bad as O(n2 ) messages per execution of any of the solution protocols. This result is punitive for networks where a large investment has been made in the construction of communication links. As broadcast is a basic communication tool (in some systems, it is a primitive one) dense networks are penalized continuously. Similarly, larger operating costs will be incurred by dense networks every time a wake-up (a very common operation, used as preliminary step in most computations) or a traversal (fortunately, not such a common operation) is performed.

52

BASIC PROBLEMS AND PROTOCOLS

The theoretical results, in other words, indicate that investments in communication hardware will result in higher operating communication costs. Obviously, this is not an acceptable situation, and it is necessary to employ some “lateral thinking.” The strategy to circumvent the obstacle posed by these lower-bounds (Theorems 2.1.1 and 2.3.1) without restricting the applicability of the protocol is fortunately simple: 1. construct a subnet G of G and 2. perform the operations only on the subnet. If the subnet G we construct is connected and spans G (i. e., contains all nodes of G), then doing broadcast on G will solve the broadcasting problem on G: Every node (entity) will receive the information. Similarly, performing a traversal on G will solve that problem on G. The important consequence is that, if G is a proper subnet, it has fewer links than G; thus, the cost of performing those operations on G will be lower than doing it in G. Which connected spanning subnet of G should we construct? If we want to minimize the message costs, we should choose the one with the fewest number of links; thus, the answer is: a spanning tree of G. So, the strategy for a general graph G will be Strategy Use-a-Tree: 1. construct a spanning tree of G and 2. perform the operations only on this spanning tree. This strategy has two costs. First, there is the cost of constructing the spanning tree; this task will have to be carried out only once (if no failures occur). Then there are the operating costs, that is the costs of performing broadcast, wake-up, and traversal on the tree. Broadcast will cost exactly n − 1 messages, and the cost of wake-up and traversal will be twice that amount. These costs are independent of m and thus do not inhibit investments in communication links (which might be useful for other reasons). 2.5 CONSTRUCTING A SPANNING TREE Spanning-tree construction (SPT) is a classical problem in computer science. In a distributed computing environment, the solution of this problem has, as we have seen, strong practical motivations. It also has distinct formulation and requirements. In a distributed computing environment, to construct a spanning tree of G means to move the system from an initial system conﬁguration, where each entity is just aware of its own neigbors, to a system conﬁguration where 1. each entity x has selected a subset Tree-neighbors(x) ⊆ N (x) and 2. the collection of all the corresponding links forms a spanning tree of G.

CONSTRUCTING A SPANNING TREE

53

What is wanted is a distributed algorithm (specifying what each node has to do when receiving a message in a given status) such that, once executed, it guarantees that a spanning tree T(G) of G has been constructed; in the following we will indicate T(G) simply by T, if no ambiguity arises. Note that T is not known a priori to the entities and might not be known after it has been constructed: an entity needs to know only which of its neighbors are also its neighbors in the spanning tree T. As before, we will restrict ourselves to connected networks with bidirectional links and further assume that no failure will occur. We will ﬁrst assume that the construction will be started by only one entity (i.e., Unique Initiator (UI) restriction); that is, we will consider spanning-tree construction under restrictions RI. We will then consider the general problem when any number of entities can independently start the construction. As we will see, the situation changes dramatically from the single-initiator scenario.

2.5.1 SPT Construction with a Single Initiator: Shout Consider the entities; they do not know G, not even its size. The only things an entity is aware of are the labels on the ports leading to its neighbors (because of the Local Orientation axiom) and the fact that, if it sends a message to a neighbor, the message will eventually be received (because of the Finite Communication Delays axiom and the Total Reliability restriction). How, using just this information, can a spanning tree be constructed? The answer is surprisingly simple. Each entity needs to know which of its neighbors are also neighbors in the spanning tree. The solution strategy is just “ask:”

Strategy Ask-Your-Neighbors: 1. The initiator s will “ask” its neighbors; that is, it will send a message Q = (“Are you my neighbor in the spanning tree"?) to all its neighbors. 2. An entity x = s will reply “Yes” only the ﬁrst time it is asked and, in this occasion, it will ask all its other neighbors; otherwise, it will reply “No.” The initiator s will always reply “No.” 3. Each entity terminates when it has received a reply from all neighbors to which it asked the question. For an entity x, its neighbors in the spanning tree T are the neighbors that have replied “Yes” and, if x = s, also the neighbor from which the question was ﬁrst asked. The corresponding set of rules is depicted in Figure 2.11 where in bold are shown the tree links and in dotted lines the nontree links. The protocol Shout implementing this strategy is shown in Figure 2.12. Initially, all nodes are in status idle except the sole initiator.

54

BASIC PROBLEMS AND PROTOCOLS

YES

Q

Q

Q

Q

YES

Q

NO

NO

TREE LINE NOT−IN−TREE

FIGURE 2.11: Set of Rules of Shout.

Before we discuss the correctness and the efﬁciency of the protocol, consider how it is structured and operates. First of all observe that, in Shout the question Q is broadcasted through the network (using ﬂooding). Further observe that, when an entity receives Q, it always sends a reply (either Yes or No). Summarizing, the structure of this protocol is a ﬂood where every information message is acknowledged. This type of structure will be called Flood + Reply.

CONSTRUCTING A SPANNING TREE

55

PROTOCOL Shout

Status: S = {INITIATOR,IDLE,ACTIVE,DONE}; SI NI T = {INITIATOR,IDLE}; ST ERM = {DONE}.

Restrictions: R ;UI. INITIATOR

Spontaneously begin root:= true; Tree-neighbors:=∅; send(Q) to N (x); counter:=0; become ACTIVE; end IDLE Receiving(Q) begin root:= false; parent:= sender; Tree-neighbors:={sender}; send(Yes) to {sender}; counter:=1; if counter=|N (x)| then become DONE else send(Q) to N (x) − {sender}; become ACTIVE; endif end ACTIVE Receiving(Q) begin send(No) to {sender}; end Receiving(Yes) begin Tree-neighbors:=Tree-neighbors ∪{sender}; counter:=counter+1; if counter=|N (x)| then become DONE; endif end Receiving(No) begin counter:=counter+1; if counter=|N (x)| then become DONE; endif end

FIGURE 2.12: Protocol Shout

Correctness Let us now show that Flood + Reply, as used above, always constructs a spanning tree; that is, the graph deﬁned by all the Tree-neighbors computed by the entities forms a spanning tree of G; furthermore, this tree is rooted in the initiator s.

56

BASIC PROBLEMS AND PROTOCOLS

Theorem 2.5.1 Protocol Shout correctly terminates. Proof. This protocol consists of the ﬂooding of Q, where every Q message is acknowledged. Because of the correctness of ﬂooding, we are guaranteed that every entity will receive Q and by construction will reply (either Yes or No) to each Q it receives. Termination then follows. To prove correctness we must show that the subnet G deﬁned by all the Treeneighbors is a spanning tree of G. First observe that, if x is in Tree-neighbors of y, then y is in Tree-neighbors of x (see Exercise 2.9.18). If an entity x sends a Yes to y, then it is in Tree-neighbors of y; furthermore, it is connected to s by a path where a Yes is sent on each link (see Exercise 2.9.19). Since every x = s sends exactly one Yes, the subnet G deﬁned by all the Tree-neighbors contains all the entities (i.e., it spans G), it is connected, and contains no cycles (see Exercise 2.9.20). Therefore, it is a spanning tree of G. 䊏 Note that G is actually a tree rooted in the initiator. Recall that, in a rooted tree , every node (except the root) has one parent: the neighbor closest to the root; all its other neighbors are called children. The neighbor to which x sends a Yes is its parent; all neighbors from which it receives a Yes are its children. This fact can be useful in subsequent operations. IMPORTANT. The execution of protocol Shout ends with local termination: each entity knows when its own execution is over; this occurs when it enters status done. Notice however that no entity, including the initiator, is aware of global termination (i.e., every entity has locally terminated). This situation is fairly common in distributed computations. Should we need the initiator to know that the execution has terminated (e.g., to start another task), Flood + Reply can be easily modiﬁed to achieve this goal (Exercise 2.9.24). Costs The message costs of Flood+Reply, and thus of Shout, are simple to analyze. As mentioned before, Flood+Reply consists of an execution of Flooding(Q) with the addition of a reply (either Yes or No) for every Q. In other words, M[Flood+Reply] = 2 M[Flooding]. The time costs of Flood+Reply, and thus of Shout, are also simple to determine; in fact (Exercise 2.9.21): T[Flood+Reply] = T[Flooding]+1. Thus M[Shout] = 4m − 2n + 2

(2.14)

T[Shout] = r(s ) + 1 ≤ d + 1

(2.15)

CONSTRUCTING A SPANNING TREE

57

The efﬁciency of protocol Shout can be evaluated better taking into account the complexity of the problem it is solving. Since every node must be involved, using an argument similar to the proof of Theorem 2.1.1, we have: Theorem 2.5.2 M(SPT/RI) ≥ m. Proof. Assume that there exists a correct SPT protocol A that, in each execution under RI on every G, uses fewer than m(G) messages. This means that there is at least one link in G where no message is transmitted in any direction during an execution of the algorithm. Consider an execution of the algorithm on G, and let e = (x, y) ∈ E be the link where no message is transmitted by A. Now construct a new graph G from G by removing the edge e and adding a new node z and two new edges e1 = (x, z) and e2 = (y, z) (see Fig. 2.2). Set z in a noninitiator status. Run exactly the same execution of A on the new graph G : since no message was sent along (x,y), this is possible. But since no message was sent along (x,y) in the original execution in G, x and y never send a message to z in the current execution in G ; and since z is not the initiator and does not receive any message, it will not send any message. Within ﬁnite time, protocol A terminates claiming that a spanning-tree T of G has been constructed; 䊏 however, z is not part of T, and hence T does not span G . And similarly to the broadcast problem we have Theorem 2.5.3 T (SPT/RI) ≥ d. This implies that protocol Shout is both time optimal and message optimal with respect to order of magnitude. In other words, Property 2.5.1 The message complexity of spanning-tree construction under RI is ⌰(m). Property 2.5.2 The ideal time complexity of spanning-tree construction under RI is ⌰(d). In the case of the number of messages some improvement might be possible in terms of the constant. Hacking Let us examine protocol Shout to see if it can be improved, thereby, helping us to save some messages. Question. Do we have to send No messages? When constructing the spanning tree, an entity needs to know who its tree-neighbors are; by construction, they are the ones that reply Yes and, except for the initiator, also

58

BASIC PROBLEMS AND PROTOCOLS

the ones that ﬁrst asked the question. Thus, for this determination, the No messages are not needed. On the contrary hand, the No messages are used by the protocol to terminate in ﬁnite time. Consider an entity x that just sent Q to neighbor y; it is now waiting for a reply. If the reply is Yes, it knows y is in the tree; if the reply is No, it knows y is not. Should we remove the sending of No–how can x determine that y would have sent No? More clearly: Suppose x has been waiting for a reply from y for a (very) long time; it does not know if y has sent Yes and the delays are very long, or y would have sent No and thus will send nothing. Because the algorithm must terminate, x cannot wait forever and has to make a decision. How can x decide? The question is relevant because communication delays are ﬁnite but unpredictable. Fortunately, there is a simple answer to the question that can be derived by examining how protocol Shout operates. Focus on a node x that just sent Q to its neighbor y. Why would y reply No ? It would do so only if it had already said Yes to somebody else; if that happened, y sent Q at the same time to all its other neighbors, including x. Summarizing, if y replies No to x, it must have already sent Q to x. We can clearly use this fact to our advantage: after x sent Q to y, if it receives Yes it knows that y is its neighbor in the tree; if it receives Q, it can deduce that y will deﬁnitely reply No to x’s question. All of this can be deduced by x without having received the No. In other words: a message Q that arrives at a node waiting for a reply can act as an implicit negative acknowledgment; therefore, we can avoid sending No messages. Let us now analyze the message complexity of the resulting protocol Shout+. The time complexity is clearly unchanged; hence T[Shout]+ = r(s ) + 1 ≤ d + 1.

(2.16)

On each link (x, y)∈ E there will be exactly a pair of messages: either Q in one direction and Yes in the other, or two Q messages, one in each direction. Thus M[Shout+] = 2m.

(2.17)

2.5.2 Other SPT Constructions with Single Initiator SPT Construction by Traversal It is well known that a depth-ﬁrst traversal of a graph G actually constructs a spanning tree (df-tree) of that graph. The df-tree is obtained by removing the back-edges from G (i.e., the edges where a Back-edge message was sent in DF Traversal). In other words, the tree-neighbors of an entity x will be those from which it receives a Return message and, if x is not the initiator, the one from which x received the ﬁrst T. Simple modiﬁcations to protocol DF* will ensure that each entity will correctly compute their neighbors in the df-tree and locally terminate in ﬁnite time (Exercise 2.9.25). Notice that these modiﬁcations involve just local bookkeeping and no

CONSTRUCTING A SPANNING TREE

59

additional communication. Hence the time and message costs are unchanged. The resulting protocol is denoted by df − SPT ; then M[df − SPT] = 4m − 2n + f + 1.

(2.18)

T[df − SPT] = 2n − 2.

(2.19)

We can now better characterize the variable f , which appears in the cost above. In fact, f is exactly the number of leaves of the df-tree constructed by df − SPT (Exercise 2.9.26). Expressions 2.18 and 2.19, when compared with the costs of protocol Shout, indicate that depth-ﬁrst traversal is not an efﬁcient tool for constructing a spanning tree; this is particularly true for its very high time costs. Notice that, like in protocol Shout, all entities will become aware of their local termination, but only the initiator will be aware of global termination, that is, that the construction of the spanning tree has been completed (Exercise 2.9.27). SPT Construction by Broadcasting We have just seen how, with simple modiﬁcations, the techniques of ﬂooding and of df-traversal can be used to construct a spanning tree, if there is a unique initiator. This fact is part of a very interesting and more general phenomenon: under RI, the execution of any broadcast protocol constructs a spanning tree. Let us examine this statement in more details. Take any broadcast protocol B; by deﬁnition of broadcast, its execution will result in all entities receiving the information initially held by the initiator. For each entity x different from the initiator, call parent the neighbor from which x received the information for the ﬁrst time; clearly, everybody except the initiator will have only one parent, and the initiator has none. Denote by x y the fact that x is the parent of y; then we have the following property whose proof is left as an exercise (Exercise 2.9.28): Theorem 2.5.4 The parent relationship deﬁnes a spanning tree rooted in the initiator. As a consequence, it would appear that, to solve SPT, we just need to execute a broadcast algorithm without any real modiﬁcation, just adding some local variables (Tree-neighbors) and doing some local bookkeeping. This is generally not the case; in fact, knowing its parent in the tree is not enough for an entity. To solve SPT, when an entity x terminates its execution, it must explicitly know which neighbors are its children as well as which neighbor are not its treeneighbors. If not provided already by the protocol, this information can obviously be acquired. For example, if every entity sends a notiﬁcation message to its parent, the parents will

60

BASIC PROBLEMS AND PROTOCOLS

know their children. To ﬁnd out which neighbors are not children is more difﬁcult and will depend on the original broadcast protocol. In protocol Shout this is achieved by adding the “Yes” (I am your child) and “No” (I am not your child) messages to Flooding. In DF Traversal protocol this is already achieved by the “Return” (I am your child) and the “Backedge” (I am not your child) messages; so, no additional communication is required. This fact establishes a computational relationship between the broadcasting problem and the spanning-tree construction problem. If I know how to broadcast, then (with minor modiﬁcations) I know how to construct a spanning tree with a unique initiator. The converse is also trivially true: Every protocol that constructs a spanning tree solves the broadcasting problem. We shall say that these two problems are computationally equivalent and denote this fact by Bcast ≡ SPT(UI).

(2.20)

Since, as we have discussed in section 2.3.4, every traversal protocol performs a broadcast, it follows that, under RI, the execution of any traversal protocol constructs a spanning tree. SPT Construction by Global Protocols Actually, we can make a much stronger statement. Call a problem global if every entity must participate in its solution; participation implies the execution of a communication activity: transmission of a message and/or arrival of a message (even if it triggers only the Null action, i.e., no action is taken). Both broadcast and traversal are global problems. Now, every single-initiator protocol that solves a global problem P solves also Bcast; thus, from Equation 2.20, it follows that, under RI, the execution of any solution to a global problem P constructs a spanning tree. 2.5.3 Considerations on the Constructed Tree We have seen how, with few more messages than those required by ﬂooding and the same messages as a df-traversal, we can actually construct a spanning tree. As discussed previously, once such a tree is constructed, we can from now on perform broadcast and traversal using only O(n) messages (which is optimal) instead of O(m) (which could be as bad as O(n2 )). IMPORTANT. Different techniques construct different spanning trees. It is even possible that the same protocol constructs different spanning trees when executed at different times. This is for example the case of Shout: Because communication delays are unpredictable, subsequent executions of this algorithm on the same graph may result in different spanning trees. In fact (Exercise 2.9.23) every possible spanning tree of G could be constructed by Shout.

CONSTRUCTING A SPANNING TREE

61

Prior to its execution, it is impossible to predict which spanning tree will be constructed; the only guarantee is that Shout will construct one. This has implications for the time costs of the strategy Use-a-Tree of broadcasting on the spanning tree T instead of the entire graph G. In fact, the broadcast time will be d(T) instead of d(G); but d(T) could be much greater than d(G). For example, if G is the complete graph, the df-tree constructed by any depth-ﬁrst traversal will have d(T ) = n − 1; but d(G) = 1. In general, the trees constructed by depth-ﬁrst traversal have usually terrible diameters. The ones generated by Shout usually perform better, but there is no guarantee on the diameter of the resulting tree. This fact poses the problem of constructing spanning trees that have a good diameter; that is, to ﬁnd a spanning tree T of G such that d(T ) is not much more than d(G). For obvious reasons, such a tree is traditionally called a broadcast tree. To construct a broadcast tree we must ﬁrst understand the relationship between radius and diameter. The eccentricity (or radius) of a node x in G is the longest of its distances to the other nodes: rG (x) = Max{dG (x, y) : y ∈V }. A node c with minimum radius (or eccentricity) is called a center; that is, ∀x ∈ V , rG (c) ≤ rG (x). There might be more than one center; they all, however, have the same eccentricity, denoted by r(G) and are called the radius of G: r(G) = Min{rG (x) : x ∈ V }. There is a strong relationship between the radius and the diameter of a graph; in fact, in every graph G, r(G) ≤ d(G) ≤ 2r(G).

(2.21)

The other ingredient we need is a breadth-ﬁrst spanning tree (bf-tree). A breadthﬁrst spanning tree of G rooted in a node u, denoted by BFT(u, G), has the following property: The distance between a node v and the root in the tree is the same as their distance in the original graph G. The strategy to construct a broadcast tree with diameter d(T ) ≤ 2d(G) is then simple to state: Strategy Broadcast-Tree Construction: 1. determine a center c of G; 2. construct a breadth-ﬁrst spanning tree BFT(c, G) rooted in c. This strategy will construct the desired broadcast tree (Exercise 2.9.29): Theorem 2.5.5 BFT(c, G) is a broadcast tree of G.

62

BASIC PROBLEMS AND PROTOCOLS

To be implemented, this strategy requires that we solve two problems: Center Finding and Breadth-First Spanning-Tree Construction. These problems, as we will see, are not simple to solve efﬁciently; we will examine them in later chapters. 2.5.4 Application: Better Traversal In Section 2.4, we have discussed the general strategy Use-a-Tree for problem solving. Now that we know how to construct a spanning tree (using a single initiator), let us apply the strategy to a known problem. Consider again the traversal problem. Using the Use-a-Tree strategy, we can produce an efﬁcient traversal protocol that is much simpler than all the algorithms we have considered before: Protocol Smart Traversal: 1. Construct, using Shout+, a spanning-tree T rooted in the initiator. 2. Perform a traversal of T, using DF Traversal. The number of messages of SmartTraversal is easy to compute: Shout+ uses 2m messages (Equation 2.17), while DF Traversal on a tree uses exactly 2(n − 1) messages (Equation 2.13). In other words, M[SmartTraversal] = 2(m + n − 1).

(2.22)

The problem with DF Traversal was its time complexity: It was to reduce time in which we developed more complex protocols. How about the time costs of this simple new protocol? The ideal time of Shout+ is exactly d + 1. The ideal time of DF Traversal in a tree is 2(n − 1). Hence the total is T[SmartTraversal] ≤ 2n + d − 1.

(2.23)

In other words, SmartTraversal not only is simple but also has optimal time and message complexity. 2.5.5 Spanning-Tree Construction with Multiple Initiators We have started examining the spanning-tree construction problem in Section 2.5 assuming that there is a unique initiator. This is unfortunately a very strong (and “unnatural”) assumption to make, as well as difﬁcult and expensive to guarantee. What happens to the single-initiator protocols Shout and df-SPT if there is more than one initiator? Let us examine ﬁrst protocol Shout. Consider the very simple case (depicted in Fig. 2.13) of three entities, x, y, and z, connected to each other. Let both x and y be initiators and start the protocol, and let the Q message from x to z arrive there before the one sent by y.

CONSTRUCTING A SPANNING TREE

Q

63

Q

X

Y

Q

X

Y

Q

Q Q

Z

Z

X

Y

X

Y

Q YES

Q Z

Z

FIGURE 2.13: With multiple initiators, Shout creates a forest.

In this case, neither the link (x,y) nor the link (y,z) will be included in the tree; hence, the algorithm creates not a spanning tree but a spanning forest, which is not connected. Consider now protocol df-SPT, discussed in Section 2.5.2. Let us examine its execution in the simple network depicted in Figure 2.14 composed of a chain of four nodes x, y, z, and w. Let y and z be both initiators, and start the traversal by sending the T message to x and w, respectively. Also in this case, the algorithm will create a disconnected spanning forest of the graph. It is easy to verify that the same situation will occur also with the optimized versions (DF+ and DF*) of the protocol (Exercise 2.9.30). The failure of these algorithms is not surprising, as they were developed speciﬁcally for the restricted environment of a Unique Initiator. Removing the restriction brings out the true nature of the problem, which, as we will now see, has a formidable obstacle. 2.5.6 Impossibility Result Our goal is to design a spanning-tree protocol, which works solely under the standard assumptions and thus is independent of the number of initiators. Unfortunately, any design effort to this end is destined to fail. In fact Theorem 2.5.6 The SPT problem is deterministically unsolvable under R. Deterministically unsolvable means that there is no deterministic protocol that always correctly terminates within ﬁnite time.

64

BASIC PROBLEMS AND PROTOCOLS

T X

T Y

Z

X

Y

Z

X

Y

Return

W

T

T

W

T Z

W

Back

Return

X

Y

Z

W

X

Y

Z

W

FIGURE 2.14: With multiple initiators, df-SPT creates a forest.

Proof. To see why this is the case, consider the simple system composed of three entities x, y, and z connected by links labeled as shown in Figure 2.15. Let the three entities have identical initial values (the symbols x, y, z are used only for description purposes). If a solution protocol A exists, it must work under any conditions of message delays (as long as they are ﬁnite) and regardless of the number of initiators. Consider a synchronous schedule (i.e., an execution where communication delays are unitary) and let all three entities start the execution of A simultaneously. Since they are in identical states (same initial status and values, same port labels), they will execute the

X 1

X 1

2

2

2

1

Y

Z 1

2

2

1

Y

Z 1

FIGURE 2.15: Proof of Theorem 2.5.6.

2

CONSTRUCTING A SPANNING TREE

65

same rule, obtain the same results (thus, continuing to have the same local values), compose and send (if any) the same messages, and enter the same (possibly new) status. In other words, by Property 1.6.2, they will remain in identical states. In the next time unit, all sent messages (if any) will arrive and be processed. If one entity receives a message, the others will receive the same message at the same time, perform the same local computation, compose and send (if any) the same messages, and enter the same (possibly new) status. And so on. In other words, the entities will continue to be in identical states. If A is a solution protocol, it must terminate within ﬁnite time. A spanning tree of our simple system is obtained by removing one of the three links, let us say (x,y). In this case, Tree-neigbors will be the port label 2 for entity x and the port label 1 for entity y; instead, z has in Tree-neighbors both port numbers. In other words, when they all terminate, they have distinct values for their local variable Tree-neighbors. But this is impossible, since we just said that the states of the entities are always identical. Thus, no such a solution algorithm A exists. 䊏 A consequence of this very negative result is that, to construct a spanning tree without constraints on the number of initiators, we need to impose additional restrictions. To determine the “minimal” restrictions that, added to R, will enable us to solve SPT is an interesting research problem still open. The restriction that is commonly used is a very powerful one, Initial Distinct Values, and we will discuss it next. 2.5.7 SPT with Initial Distinct Values The impossibility result we just witnessed implies that, to solve the SPT problem, we need an additional restriction. The one commonly used is Initial Distinct Values (ID): Each entity has a distinct initial value. Distinct initial values are sometimes called identiﬁers or ids or global names. We will now examine some ways in which SPT can be solved under IR = R ∪ {ID}. Multiple Spanning Trees As in most software design situations, once we have a solution for a problem and are faced with a more general one, one approach is to try to ﬁnd ways to re-use and re-apply the already existing solution. The solutions we already have are unique-initiator ones and, as we know, they fail in presence of multiple initiators. Let us see how can we mend their shortcomings using distinct values. Consider the execution of Shout in the example of Figure 2.13. In this case, the reason why the protocol fails is because the entities do not realize that there are two different requests (e.g., when x receives Q from y) for spanning-tree construction. But we can now use the entities’ ids to distinguish between requests originating from different initiators. The simplest and most immediate application of this approach is to have each initiator construct “its own” spanning tree with a single-initiator protocol and to use

66

BASIC PROBLEMS AND PROTOCOLS

the ids of the initiators to distinguish among different constructions. So, instead of cooperating to construct a single spanning tree, we will have several spanning trees concurrently and independently built. This implies that all the protocol messages (e.g., Q and Y es in Shout+) must contain also the id of the initiator. It also requires additional variables and bookkeeping; for example, at each entity, there will be several instances of the variable tree-neighbors, one for each spanning tree being constructed (i.e., one for each initiator). Furthermore, each entity will be in possibly different status values for each of these independent SPT-constructions. Recall that the number k of initiators is not known a priori and can change at every execution. The message cost of this approach depends solely on the number of initiators and on the type of unique-initiator protocol used. But it is in any case very expensive. In fact, if we employ the most efﬁcient SPT-construction protocol we know, Shout+, we will use 2mk messages, which could be as bad as O(n3 ). Selective Construction The large message cost derives from the fact that we construct not one but k spanning trees. Since our goal is just to construct one, there is clearly a needless amount of communication and computation being performed. A better approach consists of letting every initiator start the construction of its own uniquely identiﬁed spanning tree (as before), but then suppressing some of these constructions, allowing only one to complete. In this approach, an entity faced with two different SPT-constructions will select and act on only one, “killing” the other; the entity continues this selection process as long as it receives conﬂicting requests. The criterion an entity uses to decide which SPT-construction to follow and which one to terminate must be chosen very carefully. In fact, the danger is to “kill” all constructions. The criterion commonly used is based on min-id: Since each SPT-construction has a unique id (that of its initiator), when faced with different SPT-constructions, an entity will choose the one with the smallest id and terminate all the others. (An alternative criterion would be the one based on max-id.) The solution obtained with this approach has some very clear advantages over the previous solution. First of all, each entity is at any time involved only in one SPTconstruction; this fact greatly simpliﬁes the internal organization of the protocol (i.e., the set of rules), as well as the local storage and bookkeeping of each entity. Second, upon termination, all entities have a single shared spanning tree for subsequent uses. However, there is still competitive concurrency: An entity involved in one SPTconstruction might receive messages from another construction; in our approach, it will make a choice between the two constructions. If the entity chooses the new one, it will give up all the knowledge (variables, etc) acquired so far and start from scratch. The message cost of this approach depends again on the number of initiators and on the unique-initiator protocol used. Consider a protocol developed using this approach, using Shout+ as the basic tool. Informally, an entity u, at any time, participates in the construction of just one spanning tree rooted in some initiator, x. It will ignore all messages referring to the construction of other spanning trees where the initiators have larger ids than x. If

CONSTRUCTING A SPANNING TREE

67

instead u receives a message referring to the construction of a spanning tree rooted in an initiator y with an id smaller than x’s, then u will stop working for x and start working for y. As we will see, these techniques will construct a spanning tree rooted in the initiator with the smallest initial value. IMPORTANT. It is possible that an entity has already terminated its part of the construction of a spanning tree when it receives a message from another initiator (possibly, with a smaller id). In other words, when an entity has terminated a construction, it does not know whether it might have to restart again. Thus, it is necessary to include in the protocol a mechanism that ensures an effective local termination for each entity. This can be achieved by ensuring that we use, as a building block, a uniqueinitiator SPT-protocol in which the initiator will know when the spanning tree has been completely constructed (see Exercise 2.9.24). In this way, when the spanning tree rooted in the initiator s with the smallest initial value has been constructed, s will become aware of this fact (as well as that all other constructions, if any, have been “killed”). It can then notify all other entities so that they can enter a terminal status. The notiﬁcation is just a broadcast; it is appropriate to perform it on the newly constructed spanning-tree (so we start taking advantage of its existence). Protocol MultiShout, depicted in Figures 2.16 and 2.17, uses Shout+ appropriately modiﬁed so to ensure that the root of a constructed tree becomes aware of termination and includes a ﬁnal broadcast (on the spanning tree) to notify all entities that the task has been indeed completed. We denote by v(x) the id of x; initially all entities are idle and any of them can spontaneously start the algorithm. Theorem 2.5.7 Protocol MultiShout constructs a spanning tree rooted in the initiator with the smallest initial value. Proof. Let s be the initiator with the smallest initial value. Focus on an initiator x = s; its initial execution of the protocol will start the construction of a spanning tree Tx rooted in x. We will ﬁrst show that the construction of Tx will not be completed. To see this, observe that Tx must include every node, including s; but when s receives a message relating to the construction of somebody’s else tree (such as Tx ), it will ignore it, killing the construction of that tree. Let us now show that Ts will instead be constructed. Since the id of s is smaller than all other ids, no entity will ignore the messages related to the construction of Ts started by s; thus, the construction will be completed. 䊏 Let us now consider the message costs of protocol MultiShout. It is clearly more efﬁcient than protocols obtained with the previous approach. However, in the worst case, it is not much better in order of magnitude. In fact, it can be as bad as O(n3 ). Consider for example the graph, shown in Figure 2.18, where n − k of the nodes are fully connected among themselves (the subgraph Kn−k ), and each of the other

68

BASIC PROBLEMS AND PROTOCOLS

PROTOCOL MultiShout

Status: S = {IDLE, ACTIVE, DONE}; SI NI T = {IDLE}; ST ERM = {DONE}. Restrictions: R ;ID. IDLE Spontaneously begin root:= true; root id:=v(x); Tree neighbors:=∅; send(Q,root id) to N (x); counter:=0; check counter:=0; become ACTIVE; end Receiving(Q,id) begin CONSTRUCT; end ACTIVE Receiving(Q,id) begin if root id = id then counter:=counter+1; if counter=|N (x)| then done:= true; CHECK; endif else if root id > id then CONSTRUCT; endif end Receiving(Yes, id) begin if root id = id then Tree-neighbors:=Tree-neighbors ∪{sender}; counter:=counter+1; if counter=|N (x)| then done:= true; CHECK; endif endif end Receiving(Check, id) begin if root id = id then check counter:=check counter+1; if (done ∧ check counter=|Children|) then TERM; endif endif end Receiving(Terminate) begin send(Terminate) to Children; become DONE; end

FIGURE 2.16: Protocol MultiShout

CONSTRUCTING A SPANNING TREE

Procedure CONSTRUCT begin root:= false; root id:= id; Tree neighbors:={sender}; parent:= sender; send(Yes,root id) to {sender}; counter:=1; check counter:=0; if counter=|N (x)| then done:= true; CHECK; else send(Q,root-id) to N (x) − {sender}; endif become ACTIVE; end

Procedure CHECK begin Children:= Tree neighbors-{parent}; if Children = ∅ then send(Check,root id) to parent; endif end

Procedure TERM begin if root then send(Terminate) to Tree-neighbors; become DONE; else send(Check,root-id) to parent; endif end

FIGURE 2.17: Routines of MultiShout

x1 x2 Kn − k

xk

FIGURE 2.18: The execution of MultiShout can cost O(k(n − k)2 ) messages.

69

70

BASIC PROBLEMS AND PROTOCOLS

k (nodes x1 , x2 , . . . , xk ) is connected only to a node in Kn−k . Suppose that these k “external” nodes are the initiators and that v(x1 ) > v(x2 ) > · · · > v(xk ), Consider now an execution where the Q messages from the external entities arrive to Kn−k in order, according to the indices (i.e., the one from x1 arrives ﬁrst). When the Q message from x1 arrives to Kn−k it will trigger the SPT-construction there. Notice that the Shout+ component of our protocol with a unique initiator will use O((n − k)2 ) messages inside the subgraph Kn−k . Assume that the entire computation inside Kn−k triggered by x1 is practically completed (costing O((n − k)2 ) messages) by the time the Q message from x2 arrives to Kn−k . Since v(x1 ) > v(x2 ), all the work done in Kn−k has been wasted and every entity there must start the construction of the spanning tree rooted in x2 . In the same way, assume that the time delays are such that the Q message from xi arrives to Kn−k only when the computation inside Kn−k triggered by xi−1 is practically completed (costing O((n − k)2 ) messages). Then, in this case (which is possible), work costing O((n − k)2 ) messages will be repeated k times, for a total of O(k(n − k)2 ) messages. If k is a linear fraction of n (e.g., k = n/2), then the cost will be O(n3 ). The fact that this solution is not very efﬁcient does not imply that the approach of selective construction it uses is not effective. On the contrary, it can be made efﬁcient at the expenses of simplicity. We will examine it in great details later in the book when studying the leader election problem.

2.6 COMPUTATIONS IN TREES In this section, we consider computations in tree networks under the standard restrictions R plus clearly the common knowledge that the network is tree. Note that the knowledge of being in a tree implies that each entity can determine whether it is a leaf (i.e., it has only one neighbor) or an internal node (i.e., it has more than one neighbor). We have already seen how to solve the Broadcast, the Wake-Up, and the Traversal problems in a tree network. The ﬁrst two are optimally solved by protocol Flooding, the latter by protocol DF Traversal. These techniques constitute the ﬁrst set of algorithmic tools for computing in trees with multiple initiators. We will now introduce another very basic and useful technique, saturation, and show how it can be employed to efﬁciently solve many different problems in trees regardless of the number of initiators and of their location. Before doing so, we need to introduce some basic concepts and terminology about trees. In a tree T, the removal of a link (x,y) will disconnect T into two trees, one containing x (but not y), the other containing y (but not x); we shall denote them by T [x − y] and T [y − x], respectively. Let d[x, y] = Max{d(x, z) : z ∈ T [y − x]} be the longest distance between x and the nodes in T [y − x]. Recall that the longest distance between any two nodes is called diameter, and it is denoted by d. If d[x, y] = d, the path between x and y is said to be diametral.

COMPUTATIONS IN TREES

71

2.6.1 Saturation: A Basic Technique The technique, which we shall call Full Saturation, is very simple and can be autonomously and independently started by any number of initiators. It is composed of three stages: 1. the activation stage, started by the initiators, in which all nodes are activated; 2. the saturation stage, started by the leaf nodes, in which a unique couple of neighboring nodes is selected; and 3. the resolution stage, started by the selected pair. The activation stage is just a wake-up: each initiator sends an activation (i.e., wakeup) message to all its neighbors and becomes active; any noninitiator, upon receiving the activation message from a neighbor, sends it to all its other neighbors and becomes active; active nodes ignore all received activation messages. Within ﬁnite time, all nodes become active, including the leaves. The leaves will start the second stage. Each active leaf starts the saturation stage by sending a message (call it M) to its only neighbor, referred now as its “parent,” and becomes processing. (Note: M messages will start arriving within ﬁnite time to the internal nodes.) An internal node waits until it has received an M message from all its neighbors but one, sends a M message to that neighbor that will now be considered its “parent,” and becomes processing. If a processing node receives a message from its parent, it becomes saturated. The resolution stage is started by the saturated nodes; the nature of this stage depends on the application. Commonly, this stage is used as a notiﬁcation for all entities (e.g., to achieve local termination). Since the nature of the ﬁnal stage will depend on the application, we will only describe the set of rules implementing the ﬁrst two stages of Full Saturation. IMPORTANT. A “truncated” protocol like this will be called a “plug-in”. In its execution, not all entities will enter a terminal status. To transform it into a full protocol, some other action (e.g., the resolution stage) must be performed so that eventually all entities enter a terminal status. It is assumed that initially all entities are in the same status available. Let us now discuss some properties of this basic technique. Lemma 2.6.1 Exactly two processing nodes will become saturated; furthermore, these two nodes are neighbors and are each other’s parent. Proof. From the algorithm, it follows that an entity sends a message M only to its parent and becomes saturated only upon receiving an M message from its parent. Choose an arbitrary node x, and traverse the “up” edge of x (i.e., the edge along which the M message was sent from x to its parent). By moving along “up” edges, we must meet a saturated node s1 since there are no cycles in the graph. This node has become saturated when receiving an M message from its parent s2 . Since s2

72

BASIC PROBLEMS AND PROTOCOLS

PLUG-IN Full Saturation .

Status: S = {AVAILABLE, ACTIVE, PROCESSING, SATURATED}; SI NI T = {AVAILABLE};

Restrictions: R ∪ T. AVAILABLE

Spontaneously begin send(Activate) to N(x); Initialize; Neighbors:= N (x); if|Neighbors|=1 then Prepare Message; parent ⇐ Neighbors; send(M) to parent; become PROCESSING; else become ACTIVE; endif end Receiving(Activate) begin send(Activate) to N (x) − {sender}; Initialize; Neighbors:= N (x); if|Neighbors|=1 then Prepare Message; parent ⇐ Neighbors; send(M) to parent; become PROCESSING; else become ACTIVE; endif end ACTIVE Receiving(M) begin Process Message; Neighbors:= Neighbors−{sender}; if|Neighbors|=1 then Prepare Message; parent ⇐ Neighbors; send(M) to parent; become PROCESSING; endif end PROCESSING Receiving(M) begin Process Message; Resolve; end

FIGURE 2.19: Full Saturation

COMPUTATIONS IN TREES

73

Procedure Initialize begin nil; end Procedure Prepare Message begin M:=("Saturation"); end Procedure Process Message begin nil; end Procedure Resolve begin become SATURATED; Start Resolution stage; end

FIGURE 2.20: Procedures used by Full Saturation

has sent an M message to s1 , this implies that s2 must have been processing and must have considered s1 its parent; thus, when the M message from s1 will arrive at s2 , s2 will become saturated also. Thus, there exist at least two nodes that become saturated; furthermore, these two nodes are each other’s parent. Assume that there are more than two saturated nodes; then there exist two saturated nodes, x and y, such that d(x, y) ≥ 2. Consider a node z on the path from x to y; z could not send am M message toward both x and y; therefore, one of the nodes cannot be saturated. Therefore, the lemma holds. 䊏 IMPORTANT. It depends on the communication delays which entities will become saturated and it is therefore totally unpredictable. Subsequent executions with the same initiators might generate different results. In fact any pair of neighbors could become saturated. The only guarantee is that a pair of neighbors will be selected; since a pair of neighbors uniquely identiﬁes an edge, the one connecting them; this result is also called edge election. To determine the number of message exchanges, observe that the activation stage is a wake-up in a tree and hence it will use n + k − 2 messages (Equation 2.5), where k denotes the number of initiators. During the saturation stage, exactly one message is transmitted on each edge, except the edge connecting the two saturated nodes on which two M messages are transmitted, for a total of n − 1 + 1 = n messages. Thus, M[Full Saturation] = 2n + k − 2.

(2.24)

74

BASIC PROBLEMS AND PROTOCOLS

Notice that only n of those messages are due to the saturation stage. To determine the ideal time complexity, let I ⊆ V denote the set of initiator nodes, L ⊆ V denote the set of leaf nodes; t(x) the time delay, from the initiation of the algorithm, until node x becomes active. To become saturated, node s must have waited until all the leafs have become active and the M messages originated from them have reached s; that is, it must have waited Max{t(l) + d(l, s) : l ∈ L}. To become active, a noninitiator node x must have waited for an “Activation” message to reach it, while there is no additional waiting time for an initiator node; thus, t(x) = Min{d(x, y) + t(y) : y ∈ I }. Therefore, the total delay, from the initiation of the algorithm, until s becomes saturated (and, thus, the ideal execution delay of the algorithm) is T[Full Saturation] = Max{Min{d(l, y) + t(y)} + d(l, y) : y ∈ I, l ∈ L}.

(2.25)

We will now discuss how to apply the saturation technique to solve different problems. 2.6.2 Minimum Finding Let us see how the saturation technique can be used to compute the smallest among a set of values distributed among the nodes of the network. Every entity x has an input value v(x) and is initially in the same status; the task is to determine the minimum among those input values. That is, in the end, each entity must know whether or not its value is the smallest and enter the appropriate status, minimum or large, respectively. IMPORTANT. Notice that these values are not necessarily distinct. So, more than one entity can have the minimum value; all of them must become minimum. This problem is called Minimum Finding (MinFind) and is the simplest among the class of Distributed Query Processing problems that we will examine in later chapters: a set of data (e.g., a ﬁle) is distributed among the sites of a communication network; queries (i.e., external requests for information about the set) can arrive at any time at any site (which becomes an initiator of the processing), triggering computation and communication activities. A stronger version of this problem requires all entities to know the minimum value when they enter the ﬁnal status. Let us see how to solve this problem in a tree network. If the tree was rooted, then this task can be trivially performed. In fact, in a rooted tree not only is there a special node, the root, but also a logical orientation of the links: “up” toward the root and “down” away from the root; this corresponds to the “parent” and “children” relationship, respectively. In a rooted tree, to ﬁnd the minimum, the root would broadcast down the request to compute the minimum value; exploiting the orientation of the links, the entities will then perform a convergecast (described in more details in Section 2.6.7): starting from the leaves, the nodes determine the smallest value among the values “down” and send it “up.” As a result of this process, the minimum value is then determined at the root, which will then broadcast it to all nodes.

COMPUTATIONS IN TREES

75

PROCESSING Receiving(Notification) begin send(Notification) to N (x)−parent; if v(x) =Received Value then become MINIMUM; else become LARGE; endif end Procedure Initialize begin min:=v(x); end Procedure Prepare Message begin M:=("Saturation", min); end Procedure Process Message begin min:= MIN{min, Received Value}; end Procedure Resolve begin Notification:= ("Resolution", min); send(Notification) to N (x)−parent; if v(x) =min then become MINIMUM; else become LARGE; endif end

FIGURE 2.21: New Rule and Procedures used for Minimum Finding

Notice that convergecast can be used only in rooted trees. The existence of a root (and the additional information existing in a rooted tree) is, however, a very strong assumption; in fact, it is equivalent to assuming the existence of a leader (which, as we will see, might not be computable). Full Saturation allows to achieve the same goals without a root or any additional information. This is achieved simply by including in the M message the smallest value known to the sender. Namely, in the saturation stage the leaves will send their value with the M message, and each internal node sends the smallest among its own value and all the received ones. In other words, MinF-Tree is just protocol Full Saturation where the procedures Initialize, Prepare Message, and Process Message are as shown in Figure 2.21 and where the resolution stage is just a notiﬁcation started by the two saturated nodes, of the minimum value they have computed. This is obtained by simply modifying procedure Resolve accordingly and adding the rule for handling the reception of the notiﬁcation.

76

BASIC PROBLEMS AND PROTOCOLS

The correctness follows from the fact that both saturated nodes know the minimum value (Exercise 2.9.31). The number of message transmission for the minimum-ﬁnding algorithm MinFTree will be exactly the same as the one experienced by Full Saturation plus the ones performed during the notiﬁcation. Since a notiﬁcation message is sent on every link except the one connecting the two saturated nodes, there will be exactly n − 2 such messages. Hence M[MinF − Tree] = 3n + k − 4.

(2.26)

The time costs will be the one experienced by Full Saturation plus the ones required by the notiﬁcation. Let Sat denote the set of the two saturated nodes; then T[MinF − Tree] = T[Full Saturation] + Max{d(s, x) : s ∈ Sat, x ∈ V }.

(2.27)

2.6.3 Distributed Function Evaluation An important class of problems are those of Distributed Function Evaluation; that is, where the task is to compute a function whose arguments are distributed among the processors of a distributed memory system (e.g., the sites of a network). An instance of this problem is the the one we just solved: Minimum Finding. We will now discuss how the saturation technique can be used to evaluate a large class of functions. Semigroup Operations Let f be an associative and commutative function deﬁned over all subsets of the input values. Examples of this type of functions are: minimum, maximum, sum, product, and so forth, as well as logical predicates. Because of their algebraic properties, these functions are called semigroup operations. IMPORTANT. It is possible that some entities do not have an argument (i.e., initial value) or that the function must only be evaluated on a subset of the arguments. We shall denote the fact that x does not have an argument by v(x) = nil. The same approach that has led us to solve Minimum Finding can be used to evaluate f. The protocol Function Tree is just protocol Full Saturation where the procedures Initialize, Prepare Message, and Process Message are as shown in Figure 2.22 and where the resolution stage is just a notiﬁcation started by the two saturated nodes, of the ﬁnal result of the function they have computed. This is obtained by simply modifying procedure Resolve accordingly and adding the rule for handling the reception of the notiﬁcation. The correctness follows from the fact that both saturated nodes know the result of the function (Exercise 2.9.32). For particular types of functions, see Exercises 2.9.33, 2.9.34, and 2.9.35.

COMPUTATIONS IN TREES

77

PROCESSING Receiving(Notification) begin result:= received value; send(Notification) to N(x)−parent; become DONE; end

Procedure Initialize begin if v(x) = nil then result:=f (v(x)); else result:=nil; end Procedure Prepare Message begin M:=("Saturation", result); end Procedure Process Message begin if received value = nil then if result = nil then result:= f (result, received value); else result:= f (received value); endif endif end Procedure Resolve begin Notification:= ("Resolution", result); send(Notification) to N (x)−parent; become DONE; end

FIGURE 2.22: New Rule and Procedures used for Function Tree

The time and message costs of the protocol are exactly the same as the one for Minimum Finding. Thus, semigroup operations can be performed optimally on a tree with any number of initiators and without a root or additional information. Cardinal Statistics A useful class of functions are statistical ones, such as average, standard deviation, and so for. These functions are not semigroup operation but can nevertheless be optimally solved using the saturation technique. We will just examine, as an example, the computation of Ave, the average of the (relevant) entities’ values. Observe that Ave ≡ Sum / Size where Sum is the the sum of all (relevant) values, and Size is the number of those values. Since Sum is a semigroup operation, we already know how to compute it. Also Size is trivially computed using saturation (Exercises 2.9.36 and 2.9.37).

78

BASIC PROBLEMS AND PROTOCOLS

We can collect at the two saturated nodes Sum and Size with a single execution of Saturation: the M message will contain two data ﬁelds M=(“Saturation,” sum,size), which are initialized by each leaf node and updated by the internal ones. The resolution stage is just a notiﬁcation started by the two saturated nodes, of the average they can have computed. Similarly, a single execution of Full Saturation with a ﬁnal notiﬁcation of the result will allow the entities to compute cardinal statistics on the input values. Notice that ordinal statistics (e.g., median) are in general more difﬁcult to resolve. We will discuss them in the chapter on selection and sorting of distributed data. 2.6.4 Finding Eccentricities The basic technique has been so far used to solve single-valued problems; that is, problems whose solution requires the identiﬁcation of a single value. It can also be used to solve multi-valued problems such as the problem of determining the eccentricities of all the nodes.

PROCESSING Receiving(Notification) begin result:= received value; send(Notification) to N(x)−parent; become DONE; end Procedure Initialize begin sum:=v(x); size:=1; end Procedure Prepare Message begin M:=("Saturation", sum,size); end Procedure Process Message begin sum:= sum + Received sum; size:=size + Received size; end Procedure Resolve begin result := sum / size; Notification:= ("Resolution", result); send(Notification) to N (x)−parent; become DONE; end

FIGURE 2.23: New Rule and Procedures used for computing the Average

COMPUTATIONS IN TREES

79

The eccentricity of a node x, denoted by r(x), is the largest distance between x and any other node in the tree: r(x) = Max{d(x, y) : y ∈ V }; note that a center is a node with the smallest eccentricity. (We brieﬂy discussed center and eccentricity already in Section 2.5.3.) To compute its own eccentricity, a node x needs to determine the maximum distance from all other nodes in the tree. To accomplish this, x needs just to broadcast the request, making itself the root of the tree, and, using convergecast on this rooted tree, collect the maximum distance to itself. This approach would require 2(n − 1) messages and it is clearly optimal with respect to order of magnitude. If we want every entity to compute its eccentricity, this however would lead to a solution that requires 2(n2 − n) messages. We will now show that saturation will yield instead a O(n), and thus optimal, solution. The ﬁrst step is to use saturation to compute the eccentricity of the two saturated nodes. Notice that we do not know a priori which pair of neighbors will become saturated. We can nevertheless ensure that when they become saturated they will know their eccentricity. To do so, it is enough to include, in the M message sent by an entity x to its neighbor y, the maximum distance from x to the nodes in T [x − y], increased by 1. In this way, a saturated node s will know d[s, y] for each neighbor y; thus, it can determine its eccentricity (Exercise 2.9.38). Our goal is to have all nodes determine their eccentricity, not just the saturated ones. The interesting thing is that the information available at each entity at the end of the saturation stage is almost sufﬁcient to make them compute their own eccentricity. Consider an entity u; it sent the M message to its parent v, after it received one from all its other neighbors; the message from y = v contained d[u, y]. In other words, u knows already the maximum distance from all the entities except the ones in the tree T [v − u]. Thus, the only information u is missing is d[u, v] = Max{d(u, y) : y ∈ T [v − u]}. Notice that (Exercise 2.9.39) d[u, v] = Max{d(u, y) : y ∈ T [v − u]} = 1 + Max{d[v, z] : z = u ∈ N (v)}. (2.28) Summarizing, every node, except the saturated ones, is missing one piece of information: the maximum distance from the nodes on the other side of the link connecting it to its parent. If the parents could provide this information, the task can be completed. Unfortunately, the parents are also missing the information, unless they are the saturated nodes. The saturated nodes have all the information they need. They also have the information their neighbors are missing: let s be a saturated node and x be an unsaturated neighbor; x is missing the information d[x, s]; by Equation 2.28, this is exactly d[x, s] = 1 + Max{d[s, z] : x = z ∈ N (s)}, and s knows all the d[s, z] (they were included in the M messages it received). So, the saturated nodes s can provide the needed information to their neighbors, who can then compute their eccentricity. The nice property is that now these neighbors have the information required by their own neighbors (further away from the saturated nodes). Thus, the resolution stage of Full

80

BASIC PROBLEMS AND PROTOCOLS

PROCESSING Receiving("Resolution",dist) begin Resolve; end Procedure Initialize begin Distance[x]:= 0; end Procedure Prepare Message begin maxdist:= 1+ Max{Distance[*]}; M:=("Saturation", maxdist); end Procedure Resolve begin Process Message; Calculate Eccentricity; forall y ∈ N (x) − {parent} do maxdist:= 1 + Max{Distance[z]: z ∈ N (x) − {parent, y}}; send("Resolution", maxdist) to y; endfor become DONE; end Procedure Process Message begin Distance[sender]:= Received distance; end Procedure Calculate Eccentricity begin r(x):= Max{Distance[z]: z ∈ N (x)}; end

FIGURE 2.24: New Rule and Procedures used for computing the Eccentricities

Saturation can be used to provide the missing information: starting from the saturated nodes, once an entity receives the missing information from a neighbor, it will compute its eccentricity and provide the missing information to all its other neighbors. IMPORTANT. Notice that, in the resolution stage, an entity sends different information to each of its neighbors. Thus, unlike the resolution we used so far, it is not a notiﬁcation. The protocol Eccentricities will thus be a Full Saturation where the procedures Initialize, Prepare Message, and Process Message are as shown in Figure 2.24. The rules for handling the reception of the message, the procedure Resolve, and the procedure to calculate the eccentricity are also shown in Figure 2.24. Notice that, even though each node receives a different message in the resolution stage, only one message will be received by each node in that stage, except

COMPUTATIONS IN TREES

81

the saturated nodes, which will receive none. Thus, the message cost of protocol Eccentricities will be exactly as the one of MinF-Tree and so will the time cost: M[Eccentricities] = 3n + k − 4 ≤ 4n − 4.

(2.29)

T[Eccentricities] = T[MinF − T ree].

(2.30)

2.6.5 Center Finding A center is a node from which the maximum distance to all other nodes is minimized. A network might have more than one center. The Center Finding problem (Center) is to make each entity aware of whether or not it is a center by entering the appropriate terminal status center or not-center, respectively. A Simple Protocol To solve Center we can use the fact that a center is exactly a node with the smallest eccentricity. Thus a solution protocol consists of ﬁnding the minimum among all eccentricities, combining the protocols we have developed so far: 1. Execute protocol Eccentricities; 2. Execute the last two stages (saturation and resolution) of MinF-Tree. Part (1) will be started by the initiators; part (2) will be started by the leaves once, upon termination of their execution of Eccentricities, they know their eccentricity; the saturation stage of MinF-Tree will determine at two new saturated nodes the minimum overall eccentricity and will be broadcasted in the notiﬁcation stage by them. At that time, an entity can determine if it is a center or not. This approach will cost 3n + k − 4 messages for part (1) and n + n − 2 = 2n − 2 for part (2), for a total of 5n + k − 6 ≤ 6n − 6 messages. The time costs are no more than T[Eccentricities] +2d ≤ 4d. A Reﬁned Protocol An improvement can be derived by exploiting the structure of the problem in more details. Recall that d[x, y] = Max{d(x, z) : z ∈ T [y − x]} is the longest distance between x and the nodes in T [y − x]. Let d1 [x] and d2 [x] be the largest and second-largest of all {d[x, y] : y ∈ N (x)}, respectively. The centers of a tree have some very interesting properties. Among them Lemma 2.6.2 In a tree either there is a unique center or there are two centers and they are neighbors. Lemma 2.6.3

In a tree all centers lie on all diametral paths.

Lemma 2.6.4 A node x is a center if and only if d1 [x] − d2 [x] ≤ 1; if strict inequality holds, then x is the only center.

82

BASIC PROBLEMS AND PROTOCOLS

Lemma 2.6.5 Let y and z be neighbors of x such that d1 [x] = d[x, y] and d2 [x] = d[x, z]. If d[x, y] − d[x, z] > 1, then all centers are in T [y − x]. Lemma 2.6.4 gives us the tool we need to devise a solution protocol: an entity x can determine whether or not it is a center, provided it knows the value d[x, y] for each of its neighbors y. But this is exactly the information that was provided to x by protocol Eccentricities so it could compute r(x). This means that to solve Center it sufﬁces to execute Eccentricities. Once an entity has all the information to compute its radius, it will check whether the largest and the second largest received values differ at most by one; if so, it becomes center, otherwise not-center. Thus, the solution protocol Center Tree is obtained from Eccentricities adding this test and some bookkeeping (Exercise 2.9.40). The time and message costs of Center Tree will be exactly the same as that of Eccentricities. M[Center Tree] = 3n + k − 4 ≤ 4n − 4.

(2.31)

T[Center Tree] = T[FullSaturation].

(2.32)

An Efﬁcient Plug-In The solutions we have discussed are full protocols. In some circumstances, however, a plug-in is sufﬁcient; that is, when the centers must start another global task. In these circumstances, the goal is just for the centers to know that they are centers. In such a case, we can construct a more efﬁcient mechanism, always based on saturation, using the resolution stage in a different way. The properties expressed by Lemmas 2.6.4 and 2.6.5 give us the tools we need to devise the plug-in. In fact, by Lemma 2.6.4, x can determine whether or not it is a center once it knows the value d[x, y] for each of its neighbors y. Furthermore, if x is not a center, by Lemma 2.6.5, this information is sufﬁcient to determine in which subtree T [y − x] a center resides. Thus, the solution is to collect such values at a node x; determine whether x is a center; and, if not, move toward a center until it is reached. In order to collect the information needed, we can use the ﬁrst two stages (Wakeup and Saturation) of protocol Eccentricities. Once a node becomes saturated, it can determine whether it is a center by checking whether the largest and the second largest received values differ at most by one. If it is not a center, it will know that the center(s) must reside in the direction from which the largest value has been received. By keeping track at each node (during the saturation stage) of which neighbor has sent the largest value, the direction of the center can also be determined. Furthermore, a saturated node can decide whether it is closest to a center or its parent. The saturated node, say x, closest to a center will then send a “Center” message, containing the second largest received value increased by one, in the direction of the center. A processing node receiving such a message will, in turn, be able to determine whether it is a center and, if not, the direction toward the center(s).

COMPUTATIONS IN TREES

83

Once the message arrives at a center c, c will be able to determine if it is the only center or not (using Lemma 2.6.4); in this case, it will know which neighbor is the other center and will notify it. The Center Finding plug-in will then be the Full Saturation plug-in with the addition of the “Center” message traveling from the saturated nodes to the centers. In particular, the routines Initialize, Process Message, Prepare Message, Resolve, and the new rules governing the reception of the “Center” messages are shown in Figure 2.25.

PROCESSING Receiving("Center", value) begin Process Message; Resolve; end Procedure Initialize begin Max Value := 0; Max2 Value := 0; end Procedure Prepare Message begin M:=("Saturation", Max Value+1); end Procedure Process Message begin if Max Counter < Received value then Max2 Value := Max Value; Max Value := Received Value; Max Neighbor := sender; else if Max2 Value < Received value then Max2 Value := Received value; endif endif end Procedure Resolve begin if Max Value - Max2 Value = 1 then if Max Neighbor = parent then send(Center,Max2 Value) to Max Neighbor; endif become CENTER; else if Max Value - Max2 Value > 1 then send(Center,Max2 Value) to Max Neighbor; else become CENTER; endif endif end

FIGURE 2.25: Transforming Saturation into an efﬁcient Plug-In for Center Finding

84

BASIC PROBLEMS AND PROTOCOLS

The message cost of this plug-in is easily determined by observing that, after the Full Saturation plug-in is applied, a message will travel from the saturated node s (closest to a center) to its furthermost center c; hence, d(s, c) additional messages are exchanged. Since d(s, c) ≤ n/2, the total number of message exchanges performed is M[Center − Finding] = 2.5n + k − 2 ≤ 3.5n − 2.

(2.33)

2.6.6 Other Computations The simple modiﬁcations to the basic technique that we have discussed in the previous sections can be applied to solve a variety of other problems efﬁciently. Following is a sample of them and the key properties employed toward their solution. Finding a Median A median is a node from which the average distance to all nodes in the network is minimized. Since a median obviously minimizes the sum of the distances to all other nodes, it is also called a communication center of the network. In a tree, the key properties are: Lemma 2.6.6 In a tree either there is a unique median or there are two medians and they are neighbors.

Given a node x, and a sub-tree T , let g[T , x] = y∈T d(x, y) denote the sum of all distances between x and the nodes in T, and let G[x, y] = g[T , x] − g[T , y] = n + 2 − 2 ∗ |T [y − x]|; then Lemma 2.6.7

Entity x is a median if and only if G[x, y] ≥ 0 for all neighbors y.

Furthermore, Lemma 2.6.8 If x is not the median, there exists a unique neighbor y such that G[y, x] < 0; such a neighbor lies in the path from x to the median. Using these properties, it is simple to construct a full protocol as well as an efﬁcient plug-in, following the same approaches used for center ﬁnding (Exercise 2.9.41). Finding Diametral Paths A diametral path is a path of the longest length. In a network there might be more than one diametral path. The problem we are interested in is to identify all these paths. In distributed terms, this means that each entity needs to know if it is part of a diametral path or not, entering an appropriate status (e.g., on-path or off-path). The key property to solve this problem is Lemma 2.6.9

A node x is on a diametral path if and only if d1 [x] + d2 [x] = d.

COMPUTATIONS IN TREES

85

Thus, a solution strategy will be to determine d, d1 [x], and d2 [x] at every x and then use Lemma 2.6.9 to decide the ﬁnal status. A full protocol efﬁciently implementing this strategy can be designed using the tools developed so far (Exercise 2.9.45). Consider now designing a plug-in instead of a full protocol; that is, we are only interested in that the entities on diametral paths (and only those) become aware of it. In this case, the other key property is Lemma 2.6.4: every center lies on every diametral path. This gives us a starting point to ﬁnd the diametral paths: the centers. To continue, we can then use Lemma 2.6.9. In other words, we ﬁrst ﬁnd the centers (note: they know the diameter) and then propagate the information along the diametral paths. A center (or for that matter, a node on a diametral path) does not know a priori which one of its neighbors is also on a diametral path. It will thus send the needed information to all its neighbors which, upon receiving it, will determine whether or not they are on such a path; if so, they continue the execution (Exercise 2.9.46). 2.6.7 Computing in Rooted Trees Rooted Trees In some cases, the tree T is actually rooted; that is, there is a distinct node, r, called the root, and all links are oriented toward r. In this case, the tree T will be denoted by T[r] . If link (x,y) is oriented from y to x, x is called the parent of y and y is said to be a child of x. Similarly, a descendant of x is any entity z for which there is a directed path from z to x, and an ancestor of x is any entity z for which there is a directed path from x to z. Two important properties of a rooted tree are that the root has no parent, while every other node has only one parent (see Fig. 2.26). Before examining how to compute in rooted trees, let us ﬁrst observe the important fact that transforming a tree into a rooted one might be an impossible task.

S

(a)

(b)

FIGURE 2.26: (a) A tree T; (b) the same tree rooted in s: T[s] .

86

BASIC PROBLEMS AND PROTOCOLS

x

1

1

y

FIGURE 2.27: It is impossible to transform this tree into a rooted one.

Theorem 2.6.1 The problem of transforming trees into rooted ones is deterministically unsolvable under R. Proof. Recall that deterministically unsolvable means that there is no deterministic protocol that always correctly terminates within ﬁnite time. To see why this is true, consider the simple tree composed of two entities x and y connected by links labeled as shown in Figure 2.27. Let the two entities have identical initial values (the symbols x, y are used only for description purposes). If a solution protocol A exists, it must work under any conditions of message delays (as long as they are ﬁnite) and regardless of the number of initiators. Consider a synchronous schedule (i.e., an execution where communication delays are unitary) and let both entities start the execution of A simultaneously. Since they are identical (same initial status and values, same port labels), they will execute the same rule, obtain the same results (thus, continuing to have the same local values), compose and send (if any) the same messages, and enter the same (possibly new) status. In other words, they will remain identical. In the next time unit, all sent messages (if any) will arrive and be processed. If one entity receives a message, the other will receive the same message at the same time, perform the same local computation, compose and send (if any) the same messages, and enter the same (possibly new) status. And so on. In other words, the two entities will continue to be identical. If A is a solution protocol, it must terminate within ﬁnite time; when this occurs, one entity, say x, becomes the root. But since both entities will always have the same state in this execution, y will also become root, contradicting the fact that A is correct. Thus, no such a solution algorithm A exists. 䊏 This means that being in a rooted tree is considerably different from being in a tree. Let us see how to exploit this difference. Convergecast The orientation of the links in a rooted tree is such that each entity has a notion of “up” (i.e., towards the root) and “down” (i.e., away from the root). If we are in a rooted tree, we can obviously exploit the availability of this globally consistent orientation. In particular, in the saturation technique, the process performed in the saturation stage can be simpliﬁed as follows: Convergecast 1. a leaf sends its message to its parent; 2. each internal node waits until it receives a message from all its children; it then sends a message to its parent. In this way, the root (that does not have a parent) will be the sole saturated node and will start the resolution stage.

87

COMPUTATIONS IN TREES

This simpliﬁed process is called convergecast. If we are in a rooted tree, we can solve all the problems we discussed in the previous section (minimum ﬁnding, center ﬁnding, etc.) using convergecast in the saturation stage. In spite of its greater simplicity, the savings in cost due to convergecast is only 1 message (Exercise 2.9.47). Clearly, such an amount alone does not justify the difference between general trees and rooted ones. There are however other advantages in rooted trees, as we will see later. Totally Ordered Trees In addition to the globally consistent orientation “up and down,” a rooted tree has another powerful property. In fact, the port numbers at a node are distinct; thus, they can be sorted, for example, in increasing order, and the corresponding links can be ordered accordingly. This means that the entire tree is ordered. As a consequence, also the nodes can be totally ordered, for example, according to a preorder traversal (see Fig. 2.28). Note that a node might not be aware of its order number in the tree, although this information can be easily acquired in the entire tree (Exercise 2.9.49). This means that, in a rooted tree the root assigns unique ids to the entities. This fact shows indeed the power of rooted trees. The fact that a rooted tree is totally ordered can be exploited also in other computations. Following are two examples. Example: Choosing a Random Entity. In many systems and applications, it is necessary to occasionally select an entity at random. This occurs for instance in routing systems where, to reduce congestion, a message is ﬁrst sent to an intermediate destination chosen at random and then delivered from there to the ﬁnal destination. The same random selection is made, for example, for coordination of a computation, for control of a resource, etc. The problem is how to determine an entity at random. Let us concentrate on uniform choice; that is, every entity must have the same probability, 1/n, of being selected. A1 1

3

1

3

A2 2

1

6

2

2

6

A3 3

5

7

1

2

(a)

5

A6

A8 2

A9

A4 3

A5

1

7

1

A7

A11

A10 2

A12

(b)

FIGURE 2.28: A rooted tree is an ordered tree and unique names can be given to the nodes.

88

BASIC PROBLEMS AND PROTOCOLS

In a rooted tree, it becomes easy for the root to select uniformly an entity at random. Once unique names have been assigned in preorder to the nodes and the root knows the number n of entities, the root needs only to choose locally a number uniformly at random between 1 and n; the entity with such a name will be the selected one. At this point, the only thing that the root r still has to do is to communicate efﬁciently to the selected entity x the result of the selection. Actually, it is not necessary to assign unique names to the identities; in fact, it sufﬁces that each entity knows the number of descendents of each of its children, and the entire process (from initial notiﬁcation to all to ﬁnal notiﬁcation to x) can be performed with at most 2(n − 1) + dT (s, x) messages and 2r(s) + dT (s, x) ideal time units (Exercise 2.9.50). Example: Choosing at Random from a Distributed Set. An interesting computation is the one of choosing at random an element of a set of data distributed (without replication) among the entities. The setting is that of a set D partitioned among the entities; that is, each entity x has a subset Dx ⊆ D of the data where ∪x Dx = D and, for x = y, Dx ∩ Dy = ∅. Let us concentrate again on uniform choice; that is, every data item must have the same probability, 1/|D| of being selected. How can this be achieved? IMPORTANT. Choosing ﬁrst an entity uniformly at random and then choosing an item uniformly at random in the set stored there will NOT give a uniformly random choice from the entire set (Exercise2.9.51). Interestingly, this problem can be solved with a technique similar to that used for selecting an entity at random and with the same cost (Exercise 2.9.52). Application: Broadcast with Termination Detection Convergecast can be used whenever there is a rooted spanning tree. We will now see an application of this fact. It is a “fact of life” in distributed computing that entities can terminate the execution of a protocol at different times; furthermore, when an entity terminates, it is usually unaware of the status of the other entities. This is why we differentiate between local termination (i.e., of the entity) and global termination (i.e., of the entire system). For example, with the broadcast protocol Flooding the initiator of the broadcast does not know when the broadcast is over. To ensure that the initiator of the broadcast becomes aware of when global termination occurs, we need to use a different strategy. To develop this strategy, recall that, if an entity s performs a Flood+Reply (e.g., protocol Shout) in a tree, the tree will become rooted in s: the initiator is the root; for every other node y, the neighbor x from which it receives the ﬁrst broadcasted message is its parent, and all the neighbors that send the positive reply (e.g., “YES” in Shout and Shout+) are its children. This means that convergecast can be “appended” to any Flood+Reply protocol.

SUMMARY

89

Strategy Broadcast with Termination Detection: 1. The initiator s uses any Flood+Reply protocol to broadcast and construct a spanning tree T[s] of the network; 2. Starting from the leaves of T[s] , the entities perform a convergecast on T. At the end of the convergecast, s becomes aware of the global termination of the broadcast (Exercise 2.9.48). As for the cost, to broadcast with termination detection we need just to add the cost of the convergecast to the one of the Flood+Reply protocol used. For example, if we use Shout+, the resulting protocol that we shall call TDCast will then use 2m + n − 1 messages. The ideal time of Shout+ is exactly r(s) + 1; the ideal time of convergecast is exactly the height of the tree T[s] , that is r(s); thus, protocol TDCast has ideal time complexity 2r(s) + 1. This means that termination detection can be added to broadcast with less than twice the cost of broadcasting alone.

2.7 SUMMARY 2.7.1 Summary of Problems Broadcast [Information problem] =⇒ A single entity has special information that everybody must know. Unique Initiator Flooding: Messages = ⌰(m); Time = ⌰(d) Wake-Up [Information/Synchronization problem] =⇒ Some entities are awake; everybody must wake-up. Wake-Up ≡ (Broadcast with multiple initiators) WFlood: Messages = ⌰(m); Time = ⌰(d) Traversal [Network problem] =⇒ Starting form the initiator, each entity is visited sequentially. Unique Initiator DF-Traversal: Messages = ⌰(m); Time = ⌰(n) Spanning-Tree Construction [Network problem]=⇒ Each entity identiﬁes the subset of neighbors in the spanning tree. SPT with unique initiator ≡ Broadcast Unique Initiator: Shout: Messages = ⌰(m); Time = ⌰(d) Multiple Initiators: assume Distinct Initial Values

90

BASIC PROBLEMS AND PROTOCOLS

Election [Control problem] =⇒ One entity becomes leader, all others enter different special status. Distinct Initial Values Minimum Finding [Data problem] =⇒ Each entity must know whether its initial value is minimum or not. Center Finding [Network problem] =⇒ Each entity must know whether or not it is a center of the network. 2.7.2 Summary of Techniques Flooding: with single initiator = broadcast; with multiple initiators = wake-up. Flooding with Reply (Shout ): with single initiator, it creates a spanning tree rooted in the initiator. Convergecast: in rooted trees only. Flooding with Replies plus Convergecast (TDCast): single initiator only, initiator ﬁnds out that the broadcast has globally terminated. Saturation: in trees only. Depth-ﬁrst traversal: single initiator only.

2.8 BIBLIOGRAPHICAL NOTES Of the basic techniques, ﬂooding is the oldest one, still currently and frequently used. The more sophisticated reﬁnements of adding reply and a convergecast were discussed and employed independently by Adrian Segall [11] and Ephraim Korach, Doron Rotem and Nicola Santoro [8]. Broadcasting in a linear number of messages in unoriented hypercubes is due to Stefan Dobrev and Peter Ruzicka [6]. The use of broadcast trees was ﬁrst discussed by David Wall [12]. The depth-ﬁrst traversal protocol was ﬁrst described by Ernie Chang [3]; the ﬁrst hacking improvement is due to Baruch Awerbuch [2]; the subsequent improvements were obtained by Kadathur Lakshmanan, N. Meenakshi, and Krishnaiyan Thulasiraman [9] and independently by Israel Cidon [4]. The difﬁculty of performing a wake-up in labeled hypercubes and in complete graphs has been proved by Stefan Dobrev, Rastislav Kralovic, and Nicola Santoro [5]. The ﬁrst formal argument on the impossibility of some global computations under R (e.g., the impossibility result for spanning-tree construction with multiple initiators) is due to Dana Angluin [1]. The saturation technique is originally due to Nicola Santoro [10]; its application to center and median ﬁnding was developed by Ephraim Korach, Doron Rotem, and Nicola Santoro [8]. A decentralized solution to the ranking problem (Problem 2.9.4) was designed by Ephraim Korach, Doron Rotem, and Nicola Santoro [7]; a less efﬁcient centralized one is due to Shmuel Zaks [13].

EXERCISES, PROBLEMS, AND ANSWERS

91

2.9 EXERCISES, PROBLEMS, AND ANSWERS 2.9.1 Exercises Exercise 2.9.1 Show that protocol Flooding uses exactly 2m − n + 1 messages. Exercise 2.9.2 Design a protocol to broadcast without the restriction that the unique initiator must be the entity with the initial information. Write the new problem deﬁnition. Discuss the correctness of your protocol. Analyze its efﬁciency. Exercise 2.9.3 Modify Flooding so to broadcast under the restriction that the unique initiator must be an entity without the initial information. Write the new problem deﬁnition. Discuss the correctness of your protocol. Analyze its efﬁciency. Exercise 2.9.4 We want to move the system from an initial conﬁguration where every entity is in the same status ignorant except the one that is knowledgeable to a ﬁnal conﬁguration where every entity is in the same status. Consider this problem under the standard assumptions plus Unique Initiator. (a) Prove that, if the unique initiator is restricted to be one of the ignorant entities, this problem is the same as broadcasting (same solution, same costs). (b) Show how, if the unique initiator is restricted to be the knowledgeable entity, the problem can be solved without any communication. Exercise 2.9.5 Design a protocol to broadcast without the Bidirectional Link restriction. Discuss its correctness. Analyze its efﬁciency. Exercise 2.9.6 Prove that, in the worst case, the number of messages used by protocol WFlood is at most 2m. Show under what conditions such a bound will be achieved. Under what conditions will the protocol use only 2m − n + 1 messages? Exercise 2.9.7 Prove that protocol WFlood correctly terminates under the standard set of restrictions BL,C, and TR. Exercise 2.9.8 Write the protocol that implements strategy HyperFlood. Exercise 2.9.9 Show that the subgraph Hk (x), induced by the messages sent when using HyperFlood on the k-dimensional hypercube Hk with x as the initiator, contains no cycles. Exercise 2.9.10 Show that for every x the eccentricity of x in Hk (x) is k. Exercise 2.9.11 Prove that the message complexity of traversal under R is at least m. (Hint: use the same technique employed in the proof of Theorem 2.1.1.)

92

BASIC PROBLEMS AND PROTOCOLS

Exercise 2.9.12 Let G be a tree. Show that, in this case, no Backedge messages will be sent in any execution of DF Traversal. Exercise 2.9.13 Characterize the virtual ring formed by an execution of DF Traversal in a tree network. Show that the ring has 2n − 2 virtual nodes. Exercise 2.9.14 Write the protocol DF++. Exercise 2.9.15 Prove that protocol DF++ correctly performs a depth-ﬁrst traversal. Exercise 2.9.16 Show that, in the execution of DF++, on some back-edges there might be two “mistakes.” Exercise 2.9.17 Determine the exact number of messages transmitted in the worst case when executing DF* in a complete graph. Exercise 2.9.18 Prove that in protocol Shout, if an entity x is in Tree-neighbors of y, then y is in Tree-neighbors of x. Exercise 2.9.19 Prove that in protocol Shout, if an entity sends Yes, then it is connected to the initiator by a path where on every link a Yes has been transmitted. (Hint: use induction.) Exercise 2.9.20 cycles.

Prove that the subnet constructed by protocol Shout contains no

Exercise 2.9.21 Prove that T[Flood+Reply] = T[Flooding]+1. Exercise 2.9.22 Write the set of rules for protocol Shout+. Exercise 2.9.23 Determine under what conditions on the communication delays, protocol Shout will construct a breadth-ﬁrst spanning tree. Exercise 2.9.24 Modify protocol Shout so that the initiator can determine when the broadcast is globally terminated. (Hint: integrate in the protocol the convergecast operation for rooted trees.) Exercise 2.9.25 Modify protocol DF* so that every entity determines its neighbors in the df-tree it constructs. Exercise 2.9.26 Prove that f∗ is exactly the number of leaves of the df-tree constructed by df-SPT. Exercise 2.9.27 Prove that, in the execution of df-SPT, when the initiator becomes done, a df-tree of the network has already been constructed.

EXERCISES, PROBLEMS, AND ANSWERS

93

Exercise 2.9.28 Prove that, for any broadcast protocol, the graph induced by relationship “parent” is a spanning tree of the network. Exercise 2.9.29 of G.

Prove that the bf-tree of G rooted in a center is a broadcast tree

Exercise 2.9.30 Verify that, with multiple initiators, the optimized version DF+ and DF* of protocol df-SPT will always create a spanning forest of the graph depicted in Figure 2.14. Exercise 2.9.31 Prove that when a node becomes saturated in the execution of protocol MinF-Tree, it knows the minimum value in the network. Exercise 2.9.32 Prove that when a node becomes saturated in the execution of protocol Funct-Tree, it knows the value of f. Exercise 2.9.33 Design a protocol to determine if all the entities of a tree network have positive initial values. Any number of entities can independently start. Exercise 2.9.34 Consider a tree system where each entity has a salary and a gender. Some external investigators want to know if all the entities with a salary below $50, 000 are female. Design a solution protocol that can be started by any number of entities independently. Exercise 2.9.35 Consider the same tree system of Question 2.9.34. The investigators now want to know if there is at least one female with a salary above $50, 000. Design a solution protocol that can be started by any number of entities independently. Exercise 2.9.36 Design an efﬁcient protocol to compute the number of entities in a tree network. Any number of entities can independently start the protocol. Exercise 2.9.37 Consider the same tree system of Question 2.9.34. The investigators now want to know how many female entities are in the system. Design a solution protocol that can be started by any number of entities independently. Exercise 2.9.38 Consider the following use of the M message: a leaf will include a value v = 1; an internal node will include one plus the maximum of all the received values. Prove that the saturated nodes will compute their maximum distance from all other nodes. Exercise 2.9.39 Prove that for any link (u, v), d[u, v] = Max {d(u, y) : y∈ T [v − u]} = 1 + Max{d(v, y) : y∈ T [u − v]} = Max{d[v, z] : z = u ∈ N(v)}. Exercise 2.9.40 Modify protocol Eccentricities so it can solve Center, as discussed in Section 2.6.5.

94

BASIC PROBLEMS AND PROTOCOLS

Exercise 2.9.41 Median Finding. Construct an efﬁcient plug-in so that the median nodes know that they are such. Exercise 2.9.42 Diameter Finding. Design an efﬁcient protocol to determine the diameter of the tree. (Hint: use Lemma 2.6.2.) Exercise 2.9.43 Rank Finding in Tree. Consider a tree where each entity x has an initial value v(x); these values are not necessarily distinct. The rank of an entity x will be the rank of its value; that is, rank(x)= 1 + |{y ∈ V : v(y) < v(x)}. So, whoever has the smallest value, it has rank 1. Design an efﬁcient protocol to determine the rank of a unique initiator (i.e., under the additional restriction UI). Exercise 2.9.44 Generic Rank Finding. Consider the ranking problem described in Exercise 2.9.43. Design an efﬁcient solution protocol that is generic; that is, it works in an arbitrary connected graph. Exercise 2.9.45 Diametral Paths. A path whose length is d is called diametral. Design an efﬁcient protocol so that each entity can determine whether or not it lies on a diametral path of the tree. Exercise 2.9.46 A path whose length is d is called diametral. Design an efﬁcient plug-in so that all and only the entities on a diametral path of the tree become aware of this fact. Exercise 2.9.47 Show that convergecast uses only 1 (one) message less than the saturation stage in general trees. Exercise 2.9.48 Prove that, when an initiator of a TDCast protocol receives the convergecast message from all its children, the initial broadcast is globally terminated. Exercise 2.9.49 Show how to assign efﬁciently a unique id to the entities in a rooted tree. Exercise 2.9.50 Random Entity Selection () Consider the task of selecting uniformly at random an entity in a tree rooted at s. Show how to perform this task, started by the root, with at most 2(n − 1) + dT (s, x) messages and 2r(s) + dT (s, x) ideal time units. Prove both correctness and complexity. Exercise 2.9.51 Show why choosing uniformly at random a site and then choosing uniformly at random an element from that site is not the same as choosing uniformly at random an element from the entire set. Exercise 2.9.52 Random Item Selection () Consider the task of selecting uniformly at random an item from a set of data partitioned among the nodes of a tree rooted at s. Show how to perform this task, started by the root, with at most

EXERCISES, PROBLEMS, AND ANSWERS

95

2(n − 1) + dT (s, x) messages and 2r(s) + dT (s, x) ideal time units. Prove both correctness and complexity. 2.9.2 Problems Problem 2.9.1 Develop an efﬁcient solution to the Traversal problem without the Bidirectional Links assumption. Problem 2.9.2 Develop an efﬁcient solution to the Minimum Finding problem in a hypercube with a unique initiator (i.e., under the additional restriction UI). Note that the values might not be distinct. Problem 2.9.3 Solve the Minimum Finding problem is a system where there is already a leader; that is, under restrictions R ∪ UI. Note that the values might not be distinct. Prove the correctness of your solution, and analyze its efﬁciency. Problem 2.9.4 Ranking. () Consider a tree where each entity x has an initial value v(x); these values are not necessarily distinct. The rank of an entity x will be the rank of its value; that is, rank(x) = 1 + |{y ∈ v : v(y) < v(x)}. So, whoever has the smallest value, has rank 1. Design an efﬁcient protocol to determine the rank of all entities. prove the correctness of your protocol and analyze its complexity. 2.9.3 Answers to Exercises Answer to Exercise 2.9.13 A node appears several times in the virtual ring; more precisely, there is an instance of node z in R for each time z has received a Token or a Finished message. Let x be the initiator; node x sends a Token to each of its neighbors sequentially and receives a Finished message from each. Every node y = x receives exactly one Token (from its parent) and sends one to all its other neighbors (its children); it will also receive a Finished message from all its children and send one to its parent. In other words every node z, including the initiator x, will appear n(z) = |N (z)| times in the virtual ring. The total number of (virtual) nodes in the virtual ring is therefore z∈V |N (z)| = 2m = 2(n − 1). Answer to Exercise 2.9.16 Consider a ring network with the three nodes x, y, and z. Assume that entity x holds the Token initially. Consider the following sequence of events that take place successively in time as a result of the execution of the DF++ protocol: x sends Visited messages to y and z, sends the Token to y, and waits for a (Visited or Return) reply from y. Assume that the link (x, z) is extremely slow. When y receives the Token from x, it sends to z a Visited message and then the Token. Assume that when z receives the Token, the Visited message from x has not arrived yet; hence z sends Visited to x followed by the Token. This is the ﬁrst mistake: Token is sent on a back-edge to x, which has already been visited.

96

BASIC PROBLEMS AND PROTOCOLS

When z ﬁnally receives the Visited message from x, it realizes the Token it sent to x was a mistake. Since it has no other unvisited neighbors, z sends a Return message back to y. Since y has no other unvisited neighbors, it will then send a Return message back to x. Assume that when x receives the Return message from y, x has not received yet neither the Visited nor the Return messages sent by z. Hence, x considers z as an unvisited neighbor and sends the Token to z. This is the second mistake on the back-edge between x and z. Answer to Exercise 2.9.19 Suppose some node x is not reachable from s in the graph T induced by the “parent” relationship. This means that x never sent the Yes messages; this implies that x never received the question Q. This is impossible because, since ﬂooding is correct, every entity will receive Q; thus, no such x exists. Answer to Exercise 2.9.20 Suppose the graph T induced by the “parent” relationship (i.e., the Yes messages) contains a directed cycle x0 , x1 , . . . , xk−1 ; that is, xi is the parent of xi+1 (operations on the indices are modulo k). This cycle cannot contain the initiator s (because it does not send any Yes). We know (Exercise 2.9.19) that in T there is a path from s to each node, including those in the cycle. This means that there will be in T a node y not in the cycle that is connected to a node xi in the cycle. This means that xi sent a Yes message to y; but since it is in the cycle, it also sent a Yes message to xi−1 (operations on the indices are modulo k). This is impossible because an entity sends no more than one Yes message. Answer to Exercise 2.9.31 First show that if a node x sends M to neighbor y, N contains the smallest value in T [x − y]; then, since a saturated node receives by deﬁnition a M message from all neighbors, it knows the minimum value in the network. Prove that value sent by x to y in M is the minimum value in T [x − y] by induction on the height h of T [x − y]. Trivially true if h = 1, that is, x is a leaf. Let it be true up to k ≥ 1; we will now show it is true for h = k + 1. x sends M to y because it has received a value from all its other neighbors y1 , y2 , . . .; since the height of (T [yi − x]) is less than h, then by inductive hypothesis the value sent by yi to x is the minimum value in (T [yi − x]). This means that the smallest among v(x) and all the values received by x is the minimum value in T [x − y]; this is exactly what x sends to y. Answer to Exercise 2.9.41 It is clear that if node x knows |T [y − x]| for all neighbors y, then it can compute G[y, x] and decide whether x is itself a median and, if not, determine the direction of the median. Thus, to ﬁnd a median is sufﬁcient to modify the basic technique to supply this information to the elected node from which the median is approached. This is done by providing two counters, m1 and m2 , with each M message: When a node x sends a M message to y, then m1 = g[T [y − x], y] − 1 and m2 = |T [y − x]| − 1. An active node x processes all received M messages so that, before it sends M to the

BIBLIOGRAPHY

97

last neighbor y, it knows G[T [x − z], x] and |T [z − x]| for all other neighbors z. In particular, the elected node can determine whether it is the median and, if not, can send a message toward it; a node receiving such a message will, in turn, perform the same operations until a median is located. Once again, the total number of exchanged messages is the ones of the Full Saturation plug-in plus d(s,med), where s is the saturated node closer to the medians, and med is the median furthermost from x. Partial Answer to Exercise 2.9.48 By induction on the height of the rooted tree, prove that, in a TDCast protocol, when an entity x receives the convergecast message from all its children, all its descendants have locally terminated the broadcast. Partial Answer to Exercise 2.9.49 Perform ﬁrst a broadcast from the root to notify all entities of the start of the protocol, and then a convergecast to collect at each entity the number of its descendents. Afterwards use this information to assign distinct values to the entities according to a preorder traversal of the tree. Partial Answer to Exercise 2.9.51 Show that the data items from smaller sets will be chosen with higher probability than that of the items from larger sets. BIBLIOGRAPHY [1] D. Angluin. Local and global properties in networks of processors. In Proc. of the 12th ACM STOC Symposium on Theory of Computing, pages 82–93, 1980. [2] B. Awerbuch. A new distributed depth-ﬁrst search algorithm. Information Processing Letters, 20:147–150, 1985. [3] E.J.H. Chang. Echo algorithms: Depth parallel operations on general graphs. IEEE Transactions on Software Engineering, SE-8(4):391–401, July 1982. [4] I. Cidon. Yet another distributed depth-ﬁrst search algorithm. Information Processing Letters, 26:301–305, 1987. [5] S. Dobrev, R. Kralovic, and N. Santoro. On the difﬁculty of waking up. In print, 2006. [6] S. Dobrev and P. Ruzicka. Linear broadcasting and O(n log log n) election in unoriented hypercubes. In Proc. of the 4th International Colloquium on Structural Information and Communication Complexity, (Sirocco’97), Ascona, July 1997. To appear. [7] E. Korach, D. Rotem, and N. Santoro. Distributed algorithms for ranking the nodes of a network. In 13th SE Conf. on Combinatorics, Graph Theory and Computing, volume 36 of Congressus Numeratium, pages 235–246, Boca Raton, February 1982. [8] E. Korach, D. Rotem, and N. Santoro. Distributed algorithms for ﬁnding centers and medians in networks. ACM Transactions on Programming Languages and Systems, 6(3):380–401, July 1984. [9] K.B. Lakshmanan, N. Meenakshi, and K. Thulasiraman. A time-optimal message-efﬁcient distributed algorithm for depth-ﬁrst search. Information Processing Letters, 25:103–109, 1987.

98

BASIC PROBLEMS AND PROTOCOLS

[10] N. Santoro. Determining topology information in distributed networks. In Proc. 11th SE Conf. on Combinatorics, Graph Theory and Computing, Congressus Numeratium, pages 869–878, Boca Raton, February 1980. [11] A. Segall. Distributed network protocols. IEEE Transactions on Information Theory, IT-29(1):23–35, Jan 1983. [12] D. Wall. Mechanisms for broadcast and selective broadcast. PhD thesis, Stanford University, June 1980. [13] Shmuel Zaks. Optimal distributed algorithms for sorting and ranking. IEEE Transactions on Computers, 34:376–380, 1985.

CHAPTER 3

Election

3.1 INTRODUCTION In a distributed environment, most applications often require a single entity to act temporarily as a central controller to coordinate the execution of a particular task by the entities. In some cases, the need for a single coordinator arises from the desire to simplify the design of the solution protocol for a rather complex problem; in other cases, the presence of a single coordinator is required by the nature of the problem itself. The problem of choosing such a coordinator from a population of autonomous symmetric entities is known as Leader Election (Elect). Formally, the task consists in moving the system from an initial conﬁguration where all entities are in the same state (usually called available) into a ﬁnal conﬁguration where all entities are in the same state (traditionally called follower), except one, which is in a different state (traditionally called leader). There is no restriction on the number of entities that can start the computation, nor on which entity should become leader. We can think of the Election problem as the problem of enforcing restriction Unique Initiator in a system where actually no such restriction exists: The multiple initiators would ﬁrst start the execution of an Election protocol; the sole leader will then be the unique initiator for the subsequent computation. As election provides a mechanism for breaking the symmetry among the entities in a distributed environment, it is at the base of most control and coordination processes (e.g., mutual exclusion, synchronization, concurrency control, etc.) employed in distributed systems, and it is closely related to other basic computations (e.g., minimum ﬁnding, spanning-tree construction, traversal). 3.1.1 Impossibility Result We will start considering this problem under the standard restrictions R: Bidirectional Links, Connectivity, and Total Reliability. There is unfortunately a very strong impossibility result about election. Theorem 3.1.1 Problem Elect is deterministically unsolvable under R.

Design and Analysis of Distributed Algorithms, by Nicola Santoro Copyright © 2007 John Wiley & Sons, Inc.

99

100

ELECTION

FIGURE 3.1: Electing a leader.

In other words, there is no deterministic protocol that will always correctly terminate within ﬁnite time if the only restrictions are those in R. To see why this is the case, consider a simple system composed of two entities, x and y, both initially available and with no different initial values; in other words, they are initially in identical states. If a solution protocol P exists, it must work under any conditions of message delays. Consider a synchronous schedule (i.e., an execution where communication delays are unitary) and let the two entities start the execution of P simultaneously. As they are in identical states, they will execute the same rule, obtain the same result, and compose and send (if any) the same message; thus, they will still be in identical states. If one of them receives a message, the other will receive the same message at the same time and, by Property 1.6.2, they will perform the same computation, and so on. Their state will always be the same; hence if one becomes leader, so will the other. But this is against the requirement that there should be only one leader; in other words, P is not a solution protocol. 3.1.2 Additional Restrictions The consequence of Theorem 3.1.1 is that to break symmetry, we need additional restrictions and assumptions. Some restrictions are not powerful enough. This is the case, for example, with the assumption that there is already available a spanning tree (i.e., restriction Tree). In fact, the two-node network in which we know election is impossible is a tree. To determine which restrictions, added to R, will enable us to solve Elect, we must consider the nature of the problem. The entities have an inherent behavioral symmetry: They all obey the same set of rules plus they have an initial state symmetry (by deﬁnition of election problem). To elect a leader means to break these symmetries; in fact, election is also called symmetry breaking. To be able to do so, from the start there must be something in the system that the entities can use, something that makes (at least one of) them different. Remember that any restriction limits the applicability of the protocol. The most obvious restriction is Unique Initiator (UI): The unique initiator, known to be unique, becomes the leader. This is, however, “sweeping the problem under the carpet,” saying that we can elect a leader if there is already a leader and it knows about it. The problem is to elect a leader when many (possibly, all) entities are initiators; thus, without UI.

INTRODUCTION

101

The restriction that is commonly used is a very powerful one, Initial Distinct Values (ID), which we have already employed to circumvent a similar impossibility result for constructing a spanning tree with multiple initiators (see Section 2.5.5). Initial distinct values are sometimes called identiﬁers or ids or global names and, as we will see, their presence will be sufﬁcient to elect a leader; let id(x) denote the distinct value of x. The use of this additional assumption is so frequent that the set of restrictions IR = R ∪ {ID} is called the standard set for election. 3.1.3 Solution Strategies How can the difference in initial values be used to break the symmetry and to elect a leader? According to the election problem speciﬁcations, it does not matter which entity becomes the leader. Using the fact that the values are distinct, a possible strategy is to choose as a leader the entity with the smallest value; in other words, an election strategy is as follows: Strategy Elect Minimum: 1. ﬁnd the smallest value; 2. elect as a leader the entity with that value. IMPORTANT. Finding the minimum value is an important problem of its own, which we have already discussed for tree networks (Section 2.6.2). Notice that in that occasion, we found the minimum value without unique identiﬁers; it is the election problem that needs them. A useful variant of this strategy is the one restricting the choice of the leader to the set of entities that initiate the protocol. That is, Strategy Elect Minimum Initiator: 1. ﬁnd the smallest value among the initiators; 2. elect as a leader the entity with that value. IMPORTANT. Notice that any solution implementing the strategy Elect Minimum solves Min as well as Elect, not so the ones implementing Elect Minimum Initiator. Similarly, we can deﬁne the Elect Maximum and the Elect Maximum Initiator strategies. Another strategy is to use the distinct values to construct a rooted spanning tree of the network and to elect the root as the leader. In other words, an election strategy is as follows:

102

ELECTION

Strategy Elect Root: 1. construct a rooted spanning tree; 2. elect as the leader the root of the tree. IMPORTANT. Constructing a (rooted) spanning tree is an important problem of its own, which we have already discussed among the basic problems (Section 2.5 ). Recall that SPT, like Elect, is unsolvable under R. In the rest of this chapter, we will examine how to use these strategies to solve Elect under election’s standard set of restrictions IR = R ∪{ID}. We will do so by ﬁrst examining special types of networks and then focusing on the development of topology-independent solutions.

3.2 ELECTION IN TREES The tree is the connected graph with the “sparsest" topology: m = n − 1. We have already seen how to optimally ﬁnd the smallest value using the saturation technique: protocol MinF-Tree in Section 2.6.2. Hence the strategy Elect Minimum leads to an election protocol Tree:Elect Min where the number of messages in the worst case is as follows: M[Tree:Elect Min] = 3n + k∗ − 4 ≤ 4n − 4. Interestingly, also the strategy Elect Minimum Initiator will have the same complexity (Exercise 3.10.1). Consider now applying the strategy Elect Root. As the network is a tree, the only work required is to transform it into a rooted tree. It is not difﬁcult to see how saturation can be used to solve the problem. In fact, if Full Saturation is applied, then a saturated node knows that it itself and its parent are the only saturated nodes; furthermore, as a result of the saturation stage, every nonsaturated entity has identiﬁed as its parent the neighbor closest to the saturated pair. In other words, saturation will root the tree not in a single node but in a pair of neighbors: the saturated ones. Thus, to make the tree rooted in a single node we just need to choose only one of the two saturated nodes. In other words, the “Election” among all the nodes is reduced to an “election” between the two saturated ones. This can be easily accomplished by having the saturated nodes communicate their identities and by having the node with the smallest identity become elected, while the other stays processing. Thus, the Tree:Elect Root protocol will be Full Saturation with the new rules and the routine Resolve shown in Figure 3.2. The number of message transmissions for the election algorithm Tree Election will be exactly the same as the one experienced by Full Saturation with notiﬁcation

ELECTION IN TREES

103

SATURATED Receiving(Election, id∗) begin if id(x) < id∗ then become LEADER; else become FOLLOWER; endif send("Termination") to N (x) − {parent}; end PROCESSING Receiving("Termination") begin become FOLLOWER; send("Termination") to N(x) − {parent}; end Procedure Resolve begin send("Election",id(x)) to parent; become SATURATED; end

FIGURE 3.2: New rules and routine Resolve used for Tree:Elect Root.

plus two “Election” messages, that is, M[Tree:Elect Root]= 3n + k∗ − 2 ≤ 4n − 2. In other words, it uses two messages more than the solution obtained using the strategy Elect Minimum. Granularity of Analysis: Bit Complexity From the discussion above, it would appear that the strategy Elect Minimum is “better” because it uses two messages less than the strategy Elect Root. This assessment is indeed the only correct conclusion obtainable using the number of messages as the cost measure. Sometimes, this measure is too “coarse” and does not really allow us to see possibly important details; to get a more accurate picture, we need to analyze the costs at a “ﬁner” level of granularity. Let us re-examine the two strategies in terms of the number of bits. To do so, we have to distinguish between different types of messages because some contain counters and values, while others contain only a message identiﬁer. IMPORTANT. Messages that do not carry values but only a constant number of bits are called signals and in most practical systems, they have signiﬁcantly less communication costs than value messages. In Elect Minimum, only the n messages in the saturation stage carry a value, while all the others are signals; hence, the total number of bits transmitted will be B[Tree:Elect Min] = n (c + log id) + c (2n + k∗ − 2),

(3.1)

104

ELECTION

where id denotes the largest value sent in a message, and c = O(1) denotes the number of bits required to distinguish among the different messages. In Elect Root, only the “Election” message carries a node identity; thus, the total number of bits transmitted is B[Tree:Elect Root] = 2 (c + log id) + c (3n + k∗ − 2).

(3.2)

That is, in terms of number of bits, Elect Root is an order of magnitude better than Elect Minimum. In terms of signals and value messages, with Elect Root strategy we have only two value messages and with Elect Minimum strategy we have n value messages. Remember: Measuring the number of bits gives us always a “picture” of the efﬁciency at a more reﬁned level of granularity. Fortunately, it is not always necessary to go to such a level.

3.3 ELECTION IN RINGS We will now consider a network topology that plays a very important role in distributed computing: the ring, sometimes called loop network. A ring consists of a single cycle of length n. In a ring, each entity has exactly two neighbors, (whose associated ports are) traditionally called left and right (see Figure 3.3). IMPORTANT. Note that the labeling might, however, be globally inconsistent, that is, ‘right’ might not have the same meaning for all entities. We will return to this point later. x n−1

x0 x1

x n−2

FIGURE 3.3: A ring network.

x2

ELECTION IN RINGS

105

After trees, rings are the networks with the sparsest topology: m = n; however, unlike trees, rings have a complete structural symmetry (i.e., all nodes look the same). We will denote the ring by R = (x0 , x1 , . . . , xn−1 ). Let us consider the problem of electing a leader in a ring R, under the standard set of restrictions for election, IR = {Bidirectional Links, Connectivity, Total Reliability, Initial Distinct Values}, as well as the knowledge that the network is a ring (Ring). Denote by id(x) the unique value associated to x. Because of its structure, in a ring we will use almost exclusively the approach of minimum ﬁnding as a tool for leader election. In fact we will consider both the Elect Minimum and the Elect Minimum Initiator approaches. Clearly the ﬁrst solves both Min and Elect, while the latter solves only Elect. NOTE. Every protocol that elects a leader in a ring can be made to ﬁnd the minimum value (if it has not already been determined) with an additional n message and time (Exercise 3.10.2). Furthermore, in the worst case, the two approaches coincide: All entities might be initiators. Let us now examine how minimum ﬁnding and election can be efﬁciently performed in a ring. As in a ring each entity has only two neighbors, for brevity we will use the notation other to indicate N (x)−sender at an entity x. 3.3.1 All the Way The ﬁrst solution we will use is rather straightforward: When an entity starts, it will choose one of its two neighbors and send to it an “Election” message containing its id; an entity receiving the id of somebody else will send its id (if it has not already done so) and forward the received message along the ring (i.e., send it to its other neighbor) keeping track of the smallest id seen so far (including its own). This process can be visualized as follows: Each entity originates a message (containing its id), and this message travels “all the way” along the ring (forwarded by the other entities) (see Figure 3.4). Hence, the name All the Way will be used for the resulting protocol. Each entity will eventually see the id of everybody else id (ﬁnite communication delays and total reliability ensure that) including the minimum value; it will, thus, be able to determine whether or not it is the (unique) minimum and, thus, the leader. When will this happen ? In other words, Question. When will an entity terminate its execution? Entities only forward messages carrying values other than their own: Once the message with id(x) arrives at x, it is no longer forwarded. Thus, each value will travel “All the Way” along the ring only once. So, the communication activities will eventually terminate. But how does an entity know that the communication activities

106

ELECTION

...

5

..

4

.

...

22

4

5

22 13

...

2

.. .

13 2

17

...

17

FIGURE 3.4: All the Way: Every id travels along the ring.

have terminated, that no more messages will be arriving, and, thus, the smallest value seen so far is really the minimum id? Consider a “reasonable” but unfortunately incorrect answer: An entity knows that it has seen all values once it receives its value back. The “reason” is that the message with its own id has to travel longer along the ring to reach x than those originated by other entities; thus, these other messages will be received ﬁrst. In other words, reception of its own message can be used to detect termination. This reasoning is incorrect because it uses the (hidden) additional assumption that the system has ﬁrst in ﬁrst out (FIFO) communication channels, that is, the messages are delivered in the order in which they arrive. This restriction, called Message Ordering, is not a part of election’s standard set; few systems actually have it built in, and the costs of offering it can be formidable. So, whatever the answer, it must not assume FIFO channels. With this proviso, a “reasonable” but unfortunately still incorrect answer is the following: An entity counts how many different values it receives; when the counter is equal to n, it knows it can terminate.

ELECTION IN RINGS

107

PROTOCOL All the Way.

States: S = {ASLEEP, AWAKE, FOLLOWER, LEADER}; SINIT = {ASLEEP}; STERM = {FOLLOWER, LEADER}.

Restrictions: IR ∪Ring. ASLEEP

Spontaneously begin INITIALIZE; become AWAKE; end Receiving("Election", value∗ , counter∗ ) begin INITIALIZE; send("Election", value∗ , counter∗ +1) to other; min:= Min{ min, value}; count:= count+1; become AWAKE; end AWAKE

Receiving("Election", value∗ , counter∗ ) begin if value = id(x) then send("Election", value∗ , counter∗ +1) to other; min:= MIN{min,value∗ }; count:= count+1; if known then CHECK endif; else ringsize:= counter∗ ; known:= true; CHECK; endif end

FIGURE 3.5: Protocol All the Way.

The problem is that this answer assumes that the entity knows n, but a priori knowledge of the ring size is not a part of the standard restrictions for election. So it cannot be used. It is indeed strange that the termination should be difﬁcult for such a simple protocol in such a clear setting. Fortunately, the last answer, although incorrect, provides us with the way out. In fact, although n is not known a priori, it can be computed. This is easily accomplished by having a counter in the Election message, initialized to 1 and incremented by each entity forwarding it; when an entity receives its id back, the value of the counter will be n. Summarizing, we will use a counter at each entity, to keep track of how many different ids are received and a counter in each message, so that each entity can determine n. The protocol is shown in Figures 3.5 and 3.6. The message originated by each entity will travel along the ring exactly once. Thus, there will be exactly n2 messages in total, each carrying a counter and a value,

108

ELECTION

Procedure INITIALIZE begin count:= 0; size:= 1; known:= false; send("Election", id(x), size) to right; min:= id(x); end Procedure CHECK begin if count = ringsize then if min = id(x) then become LEADER; else become FOLLOWER; endif endif end

FIGURE 3.6: Procedures of protocol All the Way.

for a total of n2 log(id + n) bits. The time costs will be at most 2n (Exercise 3.10.3). Summarizing, M[AlltheWay] = n2

(3.3)

T[AlltheWay] ≤ 2n − 1.

(3.4)

The solution protocol we have just designed is very expensive in terms of communication costs (in a network with 100 nodes it would cause 10, 000 message transmissions). The protocol can be obviously modiﬁed so as to follow strategy Elect Minimum Initiator, ﬁnding the smallest value only among the initiators. In this case, those entities that do not initiate will not originate a message but just forward the others’. In this way, we would have fewer messages whenever there are fewer initiators. In the modiﬁcation we must be careful. In fact, in protocol All the Way, we were using an entity’s own message to determine n so as to be able to determine local termination. Now some entities will not have this information. This means that termination is again a problem. Fortunately, this problem has a simple solution requiring only n additional messages and time (Exercise 3.10.4). Summarizing, the costs of the modiﬁed protocol, All the Way:Minit, are as follows: M[AlltheWay : Minit] = nk∗ + n

(3.5)

T[AlltheWay : Minit] ≤ 3n − 1

(3.6)

The modiﬁed protocol All the Way:Minit will in general use fewer messages than the original one. In fact, if only a constant number of entities initiate, it will use only

109

ELECTION IN RINGS

O(n) messages, which is excellent. By contrast, if every entity is an initiator, this protocol uses n messages more than the original one. IMPORTANT. Notice that All the Way (in its original or modiﬁed version) can be used also in unidirectional rings with the same costs. In other words, it does not require the Bidirectional Links restriction. We will return to this point later. 3.3.2 As Far As It Can To design an improved protocol, let us determine the drawback of the one we already have: All the Way. In this protocol, each message travels all along the ring. Consider the situation (shown in Figure 3.7) of a message containing a large id, say 22, arriving at an entity x with a smaller id, say 4. In the existing protocol, x will forward this message, even though x knows that 22 is not the smallest value. But our overall strategy is to determine the smallest id among all entities; if an entity determines that an id is not the minimum, there is no need whatsoever for the message containing such an id to continue traveling along the ring. We will thus modify the original protocol All the Way so that an entity will only forward Election messages carrying an id smaller than the smallest seen so far by 2

2

4

4

5

22

5

4

22

13

2

13 5

4

17 17 13 13 5

2 4 2

5 4 2

FIGURE 3.7: Message with a larger id does not need to be forwarded.

2

110

ELECTION

that entity. In other words, an entity will become an insurmountable obstacle for all messages with a larger id “terminating” them. Let us examine what happens with this simple modiﬁcation. Each entity will originate a message (containing its id) that travels along the ring “as far as it can”: until it returns to its originator or arrives at a node with a smaller id. Hence the name AsFar (As It Can) will be used for the resulting protocol. Question. When will an entity terminate its execution? The message with the smallest id will always be forwarded by the other entities; thus, it will travel all along the ring returning to its originator. The message containing another id will instead be unable to return to its originator because it will ﬁnd an entity with a smaller id (and thus be terminated) along the way. In other words, only the message with the smallest id will return to its originator. This fact provides us with a termination detection mechanism. If an entity receives a message with its own id, it knows that its id is the minimum, that is, it is the leader; the other entities have all seen that message pass by (they forwarded it) but they still do not know that there will be no smaller ids to come by. Thus, to ensure their termination, the newly elected leader must notify them by sending an additional message along the ring. Message Cost This protocol will deﬁnitely have fewer messages than the previous one. The exact number depends on several factors. Consider the cost caused by the Election message originated by x. This message will travel along the ring until it ﬁnds a smaller id (or complete the tour). Thus, the cost of its travel depends on how the ids are allocated on the ring. Also notice that what matters is whether an id is smaller or not than another and not their actual value. In other words, what is important is the rank of the ids and how those are situated on the ring. Denote by #i the id whose rank is i. Worst Case Let us ﬁrst consider the worst possible case. Id #1 will always travel all along the ring costing n messages. Id #2 will be stopped only by id #1; so its cost in the worst case is n − 1, achievable if id #2 is located immediately after id #1 in the direction it travels. In general, id #(i + 1) will be stopped by any of those with smaller rank, and, thus, it will cost at most n − i messages; this will happen if all those entities are next to each other, and id #(i + 1) is located immediately after them in the direction it will travel. In fact, all the worst cases for each of the ids are simultaneously achieved when the ids are arranged in an (circular) order according to their rank and all messages are sent in the “increasing” direction (see Figure 3.9). In this case, including also the n messages required for the ﬁnal notiﬁcation, the total cost will be

M[AsFar] = n +

n i=1

i=

n (n + 3) . 2

(3.7)

ELECTION IN RINGS

PROTOCOL AsFar.

States: S = {ASLEEP, AWAKE, FOLLOWER, LEADER}; SINIT = {ASLEEP}; STERM = {FOLLOWER, LEADER}.

Restrictions: IR ∪Ring. ASLEEP

Spontaneously begin INITIALIZE; become AWAKE; end Receiving("Election", value) begin INITIALIZE; if value < min then send("Election", value) to other; min:= value; endif become AWAKE; end AWAKE Receiving("Election", value) begin if value < min then send("Election", value) to other; min:= value; else if value min then NOTIFY endif; endif end Receiving(Notify) send(Notify) to other; become FOLLOWER; end

where the procedures Initialize and Notify are as follows: Procedure INITIALIZE begin send("Election", id(x)) to right; min:= id(x); end Procedure NOTIFY begin send(Notify) to right; become LEADER; end

FIGURE 3.8: Protocol AsFar.

111

112

ELECTION

2 3

1

14

4

5

13

6

12

7 11

8

10 9

FIGURE 3.9: Worst case setting for protocol AsFar.

That is, we will cut the number of messages at least to half. From a theoretical point of view, the improvement is not signiﬁcant; from a practical point of view, this is already a reasonable achievement. However we have so far analyzed only the worst case. In general, the improvement will be much more signiﬁcant. To see precisely how, we need to perform a more detailed analysis of the protocol’s performance. IMPORTANT. Notice that AsFar can be used in unidirectional rings. In other words, it does not require the Bidirectional Links restriction. We will return to this point later. The worst case gives us an indication of how “bad” things could get when the conditions are really bad. But how likely are such conditions to occur? What costs can we generally expect? To ﬁnd out, we need to study the average case and determine the mean and the variance of the cost of the protocol. Average Case: Oriented Ring We will ﬁrst consider the case when the ring is oriented, that is, “right” means the same to all entities. In this case, all messages will travel in only one direction, say clockwise. IMPORTANT. Because of the unique nature of the ring network, this case coincides with the execution of the protocol in a unidirectional ring. Thus, the results we will obtain will hold for those rings.

ELECTION IN RINGS

113

To determine the average case behavior, we consider all possible arrangements of the ranks 1, . . . , n in the ring as equally likely. Given a set of size a, we denote by C(a, b) the number of subsets of size b that can be formed from it. Consider the id #i with rank i; it will travel clockwise exactly k steps if and only if the ids of its k − 1 clockwise neighbors are larger than it (and thus will forward it), while the id of its kth clockwise neighbor is smaller (and thus will terminate it). There are i − 1 ids smaller than id #i from which to choose those k − 1 smaller clockwise neighbors, and there are n − i ids larger than id #i from which to choose the kth clockwise neighbor. In other words, the number of situations where id #i will travel clockwise exactly k steps is C(i − 1, k − 1)C(n − i, 1), out of the total number of C(n − 1, k − 1)C(n − k, 1) possible situations. Thus, the probability P (i, k) that id #i will travel clockwise exactly k steps is C(i − 1, k − 1)C(n − i, 1) . C(n − 1, k − 1)C(n − k, 1)

P (i, k) =

(3.8)

The smallest id, #1, will travel the full length n of the ring. The id #i, i > 1, will travel less; the expected distance will be Ei =

n−1

k P (i, k).

(3.9)

k=1

Therefore, the overall expected number of message transmissions is E =n+

n−1 n−1 i=1 k=1

k P (i, k) = n +

n−1 k=1

n = nHn , k+1

(3.10)

where Hn = 1 + 21 + 13 + ... + n1 is the nth Harmonic number. To obtain a close formula, we use the fact that the function f (x) = x1 is continu 1 n 1 ∞ 1 ous, linear, and decreasing; thus 1 x dx = limn→∞ 1 x dx = limn→∞ ln x = n limn→∞ (ln n − ln 1 + c) = ln n + c. Hence, Hn = ln n + O(1) ≈ .69 log n + O(1); thus Theorem 3.3.1 In oriented and in unidirectional rings, protocol AsFar will cost nHn ≈ .69n log n + O(n) messages on an average. This is indeed great news: On an average, the message cost is an order of magnitude less than that in the worst case. For n = 1024, this means that on an average we have 7066 messages instead of 525, 824, which is a considerable difference. If we use the strategy of electing the Minimum Initiator instead, we obtain the same bound but as a function of the number k∗ of initiators:

114

ELECTION

Theorem 3.3.2 In oriented and in unidirectional rings, protocol AsFar-Minit will cost nHk∗ ≈ .69n log k∗ messages on an average. Average Case: Unoriented Ring Let us now consider what will happen on an average in the general case, when the ring is unoriented. As before, we consider all possible arrangements of the ranks 1, . . . , n of the values in the ring as equally likely. The fact that the ring is not oriented means that when two entities send a message to their “right” neighbors, they might send it in different directions. Let us assume that at each entity the probability that “right” coincides with the clockwise direction is 21 . Alternatively, assume that an entity, as its ﬁrst step in the protocol, ﬂips a fair coin (i.e., probability 21 ) to decide the direction it will use to send its value. We shall call the resulting probabilistic protocol ProbAsFar. √

(2) 2 nHn

Theorem 3.3.3 In unoriented rings, Protocol ProbAsFar will cost .49n log n messages on an average.

≈

A similar bound holds if we use the strategy of electing the Minimum Initiator: √

Theorem 3.3.4 In unoriented rings, protocol ProbAsFar-Minit will cost .49n log k messages on an average.

(2) 2 nHk∗

≈

What is very interesting about the bound expressed by Theorem 3.3.3 is that it is better (i.e., smaller) than the one expressed by Theorem 3.3.1. The difference between the two bounds is restricted to the constant and is rather limited. In numerical terms, the difference is not outstanding: 5018 instead of 7066 messages on an average when n = 1024. In practical terms, from the algorithm design point of view, it indicates that we should try to have the entities send their initial message in different directions (as in the probabilistic protocol) and not all in the same one (like in the oriented case). To simulate the initial “random” direction, different means can be used. For example, each entity x can choose (its own) “right” if id(x) is even, (its own) “left” otherwise. This result has also a theoretical relevance that will become apparent later, when we will discuss lower bounds and will have a closer look at the nature of the difference between oriented and unoriented rings. Time Costs The time costs are the same as the ones of All the Way plus an additional n − 1 for the notiﬁcation. This can, however, be halved by exploiting the fact that the links are bidirectional and by broadcasting the notiﬁcation; this will require an extra message but halve the time. Summary The main drawback of protocol AsFar is that there still exists the possibility that a very large number of messages (O(n2 )) will be exchanged. As we have seen, on an average, the use of the protocol will cost only O(n log n) messages. There

ELECTION IN RINGS

115

is, however, no guarantee that this will happen the next time the protocol will be used. To give such a guarantee, a protocol must have a O(n log n) worst case complexity. 3.3.3 Controlled Distance We will now design a protocol that has a guaranteed O(n log n) message performance. To achieve this goal, we must ﬁrst of all determine what causes the previous protocol to use O(n2 ) messages and then ﬁnd ways around it. The ﬁrst thing to observe is that in AsFar (as well as in All the Way), an entity makes only one attempt to become leader and does so by originating a message containing its id. Next observe that, once this message has been created and sent, the entity has no longer any control over it: In All the Way the message will travel all along the ring; in AsFar it will be stopped if it ﬁnds a smaller id. Consider now the situation that causes the worst case for protocol AsFar: this is when the ids are arranged in an increasing order along the ring, and all entities identify “right” with the clockwise direction (see Figure 3.9). The entity x with id 2 will originate a message that will cause n − 2 transmissions. When x receives the message containing id 1, x ﬁnds out that its own value is not the smallest, and thus its message is destined to be wasted. However, x has no means to stop it as it has no longer any control over that message. Let us take these observations into account to design a more efﬁcient protocol. The key design goal will be to make an entity retain some control over the message it originates. We will use several ideas to achieve this: 1. limited distance: The entity will impose a limit on the distance its message will travel; in this way, the message with id 2 will not travel “as far as it can” (i.e., at distance n − 2) but only up to some predeﬁned length. 2. return (or feedback) messages: If, during this limited travel, the message is not terminated by an entity with smaller id, it will return back to its originator to get authorization for further travel; in this way, if the entity with id 2 has seen id 1, it will abort any further travel of its own message. Summarizing, an entity x will originate a message with its own id, and this message will travel until it is terminated or it reaches a certain distance dis; if it is not terminated, the message returns to the entity. When it arrives, x knows that on this side of the ring, there are no smaller ids within the traveled distance dis. The entity must now decide if to allow its message to travel a further distance; it will do so only if it knows for sure that there are no smaller ids within distance dis on the other side of the ring as well. This can be achieved as follows: 3. check both sides: The entity will send a message in both directions; only if they both return, they will be allowed to travel a further distance. As a consequence, instead of a single global attempt at leadership, an entity will go through several attempts, which we shall call Electoral Stages: An entity enters the

116

ELECTION

dis

i

dis

i+1

dis

i

dis

i+1

FIGURE 3.10: Controlled distances: A message travels no more than dis(i); if it is not discarded, a feedback is sent back to the originator. A candidate that receives a feedback from both sides starts the next stage.

next stage only if it passes the current one (i.e., both messages return) (see Fig. 3.10). If an entity is defeated in an electoral stage (i.e., at least one of its messages does not return), it still will have to continue its participation in the algorithm forwarding the messages of those entities that are still undefeated. Although the protocol is almost all outlined, some fundamental issues are still unresolved. In particular, the fact that we now have several stages can have strange consequences in the execution. IMPORTANT. Because of variations in communication delays, it is possible that at the same time instant, entities in different parts of the ring are in different electoral stages. Furthermore, as we are only using the standard restrictions for elections, messages can be delivered out of order; thus, it might be possible that messages from a higher stage will arrive at an entity before the ones from the current one. We said that an entity is defeated if it does not receive one of its messages back. Consider now an entity x; it has sent its two messages and it is now waiting to know the outcome. Let us say that one of its messages has returned but the other has not yet. It is possible that the message is coming very slowly (e.g., experiencing long transmission delays) or that it is not coming at all (i.e., it found a smaller id on the way). How can x know ? How long will x have to wait before taking a decision (a decision must be taken within ﬁnite time)? More speciﬁcally, what will x do if, in the meanwhile, it receives a message from a higher stage ? The answer to all these

ELECTION IN RINGS

117

questions is fortunately simple: 4. the smallest id wins: If, at any time, a candidate entity receives message with a smaller id, it will become defeated, regardless of the stage number. Notice that this creates a new situation: A message returns to its originator and ﬁnds it defeated; in this case, the message will be terminated. The ﬁnal issue we need to address is termination. The limit to the travel distance for a message in a given stage will depend on the stage itself; let disi denote the limit in stage i. Clearly, these distances must be monotonically increasing, that is, disi > disi−1 . The messages from an entity whose id is not the minimum will sooner or later encounter a smaller id in their travel and will not return to their originator. Consider now the entity s with the smallest id. In each stage, both of its messages will travel the full allocated distance (as no entity can terminate them) and return, making s enter the next stage. This process will continue until disi ≥ n; at this time, each message will complete a full tour of the ring reaching s from the other side. When this happens, s will know that it has the smallest value and, thus, it is the leader. It will then start a notiﬁcation process so that all the other entities can enter a terminal state. A synthetic description of the protocol will thus be as follows: in each electoral stage there are some candidates; each candidate sends a message in both directions carrying its own id (as well as the stage number); a message travels until it encounters a smaller id or it reaches a certain distance (whose value depends on the stage); if a message does not encounter a smaller id, it will return back to its originator; a candidate that receives both of its own messages back survives this stage and starts the next one; with three meta rules: if a candidate receives its message from the opposite side it sent to, it becomes the leader and notiﬁes all the other entities of termination; if a candidate receives a message with a smaller id, it becomes defeated, regardless of the stage number; a defeated entity forwards the messages originating from the other entities; if the message is notiﬁcation of termination, it will terminate. The fully speciﬁed protocol Control is shown in Figures 3.11 and 3.12, where dis is a monotonically increasing function. Correctness The correctness of the algorithm follows from the dynamics of the rules: The messages containing the smallest id will always travel all the allocated

118

ELECTION

PROTOCOL Control.

States: S = {ASLEEP, CANDIDATE, DEFEATED, FOLLOWER, LEADER}; SINIT = {ASLEEP}; STERM = {FOLLOWER, LEADER}.

Restrictions: IR ∪Ring. ASLEEP

Spontaneously begin INITIALIZE; become CANDIDATE; end Receiving("Forth", id*, stage*, limit*) begin if id* < id(x) then PROCESS-MESSAGE; become DEFEATED else INITIALIZE; become CANDIDATE; endif end CANDIDATE Receiving("Forth", id*, stage*, limit*) begin if id* < id(x) then PROCESS-MESSAGE; become DEFEATED else if id* = id(x) then NOTIFY endif; endif end Receiving("Back", id*) begin if id* = id(x) then CHECK endif; end Receiving(Notify) begin send(Notify) to other; become FOLLOWER; end DEFEATED Receiving() begin send() to other; if = Notify then become FOLLOWER endif; end

FIGURE 3.11: Protocol Control.

ELECTION IN RINGS

119

Procedure INITIALIZE begin stage:= 1; limit:= dis(stage); count:= 0; send("Forth", id(x), stage, limit) to N(x); end Procedure PROCESS-MESSAGE begin limit*:=limit*-1; if limit* =0 then send("Back",id*, stage*) to sender; else send("Forth", id*, stage*, limit*) to other; endif end Procedure CHECK begin count:=count+1; if count = 1 then count:= 0 stage:= stage+1 limit:= dis(stage); send("Forth", id(x), stage, limit) to N (x); endif end Procedure NOTIFY begin send(Notify) to right; become LEADER; end

FIGURE 3.12: Procedures used by protocol Control.

distance, and every entity still candidate they encounter will be transformed in defeated; the distance is monotonically increasing in the number of stages; hence, eventually, the distance will be at least n. When this happens, the messages with the smallest value will travel all along the ring; as a result, their originator becomes leader and all the others are already defeated. Costs The costs of the algorithm depend totally on the choice of the function dis used to determine the maximum distance a “Forth” message can travel in a stage. Messages If we examine the execution of the protocol at some global time t, because communication delays are unpredictable, we can ﬁnd not only that entities in different parts of the ring are in different states (which is expected) but also that entities in the candidate state are in different stages. Moreover, because there is no Message Ordering, messages from high stages (the “future”) might overtake messages from lower stages and arrive at an entity still in a lower stage (the “past”). Still, we can visualize the execution as proceeding in logical stages; it is just that different entities might be executing the same stage at different times.

120

ELECTION

Focus on stage i > 1 and consider the entities that will start this stage; these ni entities are those that survived stage i − 1. To survive stage i − 1, the id of x must be smaller than the ids of its neighbors at distance up to dis(i) on each side of the ring. Thus, within any group of dis(i) + 1 consecutive entities, at most one can survive stage i − 1 and start stage i. In other words, ni ≤

n . dis(i − 1) + 1

(3.11)

An entity starting stage i will send “Forth” messages in both directions; each message will travel at most dis(i), for a total of 2ni dis(i) message transmissions. Let us examine now the “Back” messages. Each entity that survives this stage will receive such a message from both sides; as ni+1 entities survive this stage, this gives an additional 2ni+1 dis(i) messages. Each entity that started but did not survive stage i will receive either no or at most one “Back” message, causing a cost of at most dis(i); as there are ni − ni+1 such entities, they will cost no more than an additional (ni − ni+1 )dis(i) messages in total. So, in total, the transmissions for “Back” messages are at most 2ni+1 dis(i) + (ni − ni+1 )dis(i). Summarizing, the total number of messages sent in stage i > 1 will be no more than 2 ni dis(i) + 2 ni+1 dis(i) + (ni − ni+1 ) dis(i) = (3 ni + ni+1 ) dis(i)

n dis(i) n ≤ 3 dis(i−1)+1 + dis(i)+1 dis(i) < n 3 dis(i−1) +1 . The ﬁrst stage is a bit different, as every entity starts; the n2 entities that survive this stage will have caused the messages carrying their id to travel to distance dis(1) and back on both sides, for a total of 4n2 dis(1) messages. The n − n2 entities that will not survive will cause at most three messages each (two “Forth” and one “Back”) to travel distance dis(1), for a total of 3(n1 − n2 ) dis(1) messages. Hence the ﬁrst stage will cost no more than

n 3n + n2 dis 1 ≤ 3n + dis(1)+1 dis 1 < n (3 dis 1 + 1 . To determine the total number of messages, we then need to know the total number k of stages. We know that a leader is elected as soon as the message with the smallest value makes a complete tour of the ring, that is, as soon as dis(i) is greater or equal to n. In other words, k is the smallest integer such that dis(k) ≥ n; such an integer is called the pseudo-inverse of n and denoted by dis−1 (n). So, the total number of messages used by protocol Control will be at most

M[Control] ≤ n

−1 (n) dis

i=1

dis(i) 3 + 1 + n, dis(i − 1)

where dis(0) = 1 and the last n messages are those for the ﬁnal notiﬁcation.

(3.12)

ELECTION IN RINGS

121

To really ﬁnalize the design, we must choose the function dis. Different choices will result in different performances. dis(i) = 2 (i.e., we double Consider, for example, the choice dis i = 2i−1 ; then dis(i−1) the distance every time) and dis −1 (n) = log n + 1, which in Expression 3.12 yields M[Control] ≤ 7 n log n + O(n), which is what we were aiming for: a O(n log n) worst case. The constant can be, however, further improved by carefully selecting dis. It is rather difﬁcult to determine the best function. Let us restrict the choice to among the functions where, like the one above, the ratio between consecutive values is constant, dis(i) that is, dis(i−1) = c. For these functions, dis−1 (n) = logc (n) + 1; thus, Expression 3.12 becomes 3c+1 log c n log n + O(n).

Thus, with all of them, protocol Control has a guaranteed O(n log n) performance. The “best” among those functions will be the one where 3c+1 log c is minimized; as distances must be integer quantities, also c must be an integer. Thus such a best choice is c = 3 for which we obtain M[Control] ≤ 6.309 n log n + O(n).

(3.13)

Time The ideal time complexity of procedure Control is easy to determine; the time required by stage i is the time needed by the message containing the smallest id to reach its assigned distance and come back to its originator; hence exactly 2dis(i) time units. An additional n time units are needed for the ﬁnal notiﬁcation, as well as for the initial wake-up of the entity with the smallest id. This means that the total time costs will be at most

T[Control] ≤ 2n +

−1 (n) dis

2 dis(i).

(3.14)

i=1

Again, the choice of dis will inﬂuence the complexity. Using any function of the form dis(i) = ci−1 , where c is a positive integer, will yield O(n) time. The determination of the best choice from the time costs point of view is left as an exercise. Electing Minimum Initiator () Let us use the strategy of electing a leader only among the initiators. Denote as usual by k the number of initiators. Let us analyze the worst case. In the analysis of protocol Control, we have seen that those that survive stage i contribute 4 dis(i) messages each to the cost, while those that do not survive contribute at most 3 dis(i) messages each. This is still true in the modiﬁed version Control-Minit;

122

ELECTION

what changes is the values of the number ni of entities that will start that stage. Initially, n1 = k . In the worst case, the k initiators are placed far enough from each other in the ring that each completes the stage without interfering with the others; if the distances between them are large enough, each can continue to go to higher stages without coming into contact with the others, thus, causing 4 dis(i) messages. For how many stages can this occur ? This can occur as long as dis(i) < kn+1 . That is, in the worst case, ni = k in each of the ﬁrst l = dis−1 kn+1 − 1 stages, and the cost will be 4 k dis(i) messages. In the following stages instead, the initiators will start interfering with each other, of survivors will follow the pattern andn1the number . of the general algorithm: ni ≤ dis(i−1)+1 Thus, the total number M[Control-Minit] of messages in the worst case will be at most M[Control-Minit] ≤ 4 k

l i=1

dis i + n

−1 (n) dis

i=l+1

dis(i) 3 +1 dis(i − 1)

+ n. (3.15)

3.3.4 Electoral Stages In the previous protocol, we have introduced and used the idea of limiting the distances to control the complexity of the original “as far as it can” approach. This idea requires that an entity makes several successive attempts (at increasing distances) to become a leader. The idea of not making a single attempt to become a leader (as it was done in All the Way and in AsFar), instead of proceeding in stages, is a very powerful algorithmic tool of its own. It allows us to view the election as a sequence of electoral stages : At the beginning of each stage, the “candidates" run for election; at the end of the stage, some “candidates" will be defeated, the others will start the next stage. Recall that “stage” is a logical notion, and it does not require the system to be synchronized; in fact, parts of the system may run very fast while other parts may be slow in their operation, so different entities might execute a stage at totally different times. We will now see how the proper use of this tool allows us to achieve even better results, without controlling the distances and without return (or feedback) messages. To simplify the presentation and the discussion, we will temporarily assume that there is Message Ordering (i.e., the links are FIFO); we will remove the restriction immediately after. As before, we will have each candidate send a message carrying its own id in both directions. Without setting an a priori ﬁxed limit on the distance these messages can travel, we still would like to avoid them to travel unnecessarily far (costing too many transmissions). The strategy to achieve this is simple and effective: A message will travel until it reaches another candidate in the same (or higher) stage.

ELECTION IN RINGS

123

The consequence of this simple strategy is that in each stage, a candidate will receive a message from each side; thus, it will know the ids of the neighboring candidate on each side. We will use this fact to decide whether a candidate x enters the next stage: x will survive this stage only if the two received ids are not smaller than its own id(x) (recall we are electing the entity with the smallest id); otherwise, it becomes defeated. As before, we will have defeated entities continue to participate by forwarding received messages. Correctness and termination are easy to verify. Observe that the initiator with the smallest identity will never become defeated; by contrast, at each stage, its message will transform into defeated the neighboring candidate on each side (regardless of their distance). Hence, the number of candidates decreases at each stage. This means that eventually, the only candidate left is the one with the minimum id. When this happens, its messages will travel all along the ring (forwarded by the defeated entities) and reach it. Thus, a candidate receiving its own messages back knows that all other entities are defeated; it will then become leader and notify all other entities of termination. Summarizing (see also Figure 3.13): A candidate x sends a message in both directions carrying its identity; these messages will travel until they encounter another candidate node. By symmetry, entity x will receive two messages, one from the “left" and one from the “right" (independently of any sense of direction); it will then become defeated if at least one of them carries an identity smaller than its own; if both the received identities are larger than its own, it starts the next stage; ﬁnally, if the received identities are its own, it becomes leader and notiﬁes all entities of termination. A defeated node will forward any received election message, and each noninitiator will automatically become defeated upon receiving an election message. The protocol is shown in Figure 3.14, where close and open denote the operation of closing a port (with the effect of enqueueing incoming messages) and opening a closed port (dequeueing the messages), respectively, and where procedure Initialize is shown in Figure 3.15.

x

x

x

y

x

x > Min{y,z} => x defeated x < Min{y,z} => x candidate next stage x = Min{y,z} => x leader

FIGURE 3.13: A candidate x in an electoral stage.

z

124

ELECTION

PROTOCOL Stages.

States: S = {ASLEEP, CANDIDATE, WAITING, DEFEATED, FOLLOWER, LEADER}; SINIT = {ASLEEP}; STERM = {FOLLOWER, LEADER}.

Restrictions: IR ∪Ring. ASLEEP

Spontaneously begin INITIALIZE; become CANDIDATE; end Receiving("Election", id*, stage*) begin INITIALIZE; min:= Min(id*,min); close(sender); become WAITING; end CANDIDATE Receiving("Election", id*, stage*) begin if id* = id(x) then min:= Min(id*,min); close(sender); become WAITING; else send(Notify) to N (x); become LEADER; end WAITING Receiving("Election", id*, stage*) open(other); stage:= stage+1; min:= Min(id*,min); if min= id(x) then send("Election", id(x), stage) to N (x); become CANDIDATE; else become DEFEATED; endif end DEFEATED Receiving() begin send() to other; if = Notify then become FOLLOWER endif; end

FIGURE 3.14: Protocol Stages.

Messages It is not so obvious that this strategy is more efﬁcient than the previous one. Let us ﬁrst determine the number of messages exchanged during a stage. Consider the segment of the ring between two neighboring candidates in stage i, x, and

ELECTION IN RINGS

125

Procedure INITIALIZE begin stage:= 1; count:= 0; min:= id(x); send("Election", id(x), stage) to N (x); end

FIGURE 3.15: Procedure Initialize used by protocol Stages.

y = r(i, x); in this stage, x will send a message to y and y will send one to x. No other messages will be transmitted during this stage in that segment. In other words, on each link, only two messages will be transmitted (one in each direction) in this stage. Therefore, in total, 2n message exchanges will be performed during each stage. Let us determine now the number of stages. Consider a node x that is candidate at the beginning of stage i and is not defeated during this stage; let y = r(i, x) and z = l(i, x) be the ﬁrst entity to the right and to the left of x, respectively, that are also candidates in stage i (Figure 3.16). It is not difﬁcult to see that if x survives stage i, both r(i, x) and l(i, x) will be defeated. Therefore, at least half of the candidates are defeated at each stage. In other words, at most half of them survive: ni ≤

ni−1 2 .

As n1 = n , the total number of stages is at most σStages ≤ log n + 1. Combining the two observations, we obtain, M[Stages] ≤ 2 n log n + O(n).

(3.16)

That is, protocol Stages outperforms protocol Control. Observe that equality is achievable in practice (Exercise 3.10.9). Further note that if we use the Minimum Initiator approach the bound will become M[Stages:Minit] ≤ 2 n log k∗ + O(n).

x

l(i,x)

defeated

(3.17)

r(i,x)

candidate

FIGURE 3.16: If x survives this stage, its neighboring candidates will not.

Removing Message Ordering The correctness and termination of Stages are easy to follow also because we have assumed in our protocol that there is Message

126

ELECTION

Ordering. This assumption ensured that the two messages received by a candidate in stage i are originated by candidates also in stage i. If we remove the Message Ordering restriction, it is possible that messages arrive out of order and that a message sent in stage j > i arrives before a message sent in stage i. Simple Approach The simplest way to approach this problem is by enforcing the “effects” of Message Ordering, without really having it. 1. First of all, each message will also carry the stage number of the entity originating it. 2. When a candidate node x in stage i receives a message M∗ with stage j > i, it will not process it but will locally enqueue it until it has received from that side (and processed) all the messages from stages i, i + 1, . . . , j − 1, which have been “jumped over” by M∗; it will then process M∗. The only modiﬁcation to protocol Stages as described in Figure 3.14 is the addition of the local enqueueing of messages (Exercise 3.10.6); as this is only local processing, the message and time costs are unchanged. Stages∗ An alternative approach is to keep a track of a message “jumping over” others but without enqueueing it locally. We shall describe it in some details and call Stages* the corresponding protocol. 1. First of all, we will give a stage number to all the nodes: For a candidate entity, it is the current stage; for a defeated entity, it is the stage in which it was defeated. We will then have a defeated node forward only messages from higher stages. 2. A candidate node x in stage i receiving an Election message M∗ with stage j > i will use the id included in the message, id*, and will make a decision about the outcome of the stage i as if both of them were in the same stage. • If x is defeated in this round, then it will forward the message M∗. • If x survives, it means that id(x) is smaller not only than id* in M∗ but also than the ids in the messages “jumped over” by M∗ (Exercise3.10.13). In this case, x can act because it has received already from that side all the messages from stages i, i + 1, . . . , j , and they all have an id larger than id(x). We will indicate this fact by saying that x has now a credit of j − i messages on that port. In other words, if a candidate x has a credit c > 0 associated with a port, it does not have to wait for a message from that port during the current stage. Clearly, the credit must be decreased in each stage. To write the set of rules for protocol Stages* is a task that, although not difﬁcult, requires great care and attention to details (Exercise 3.10.12); similar characteristics has the task of proving the correctness of the protocol Stages* (Exercise 3.10.14). As for the resulting communication complexity, the number of messages is never more (sometimes less) than that with Message Ordering (Exercise 3.10.15).

ELECTION IN RINGS

127

Interestingly, if we attempt to measure the ideal time complexity, we will only see executions with Message Ordering. In other words, the phenomenon of messages delivered out of order will disappear. This is yet another case showing how biased and limited (and thus dangerous) ideal time is as a cost measure. 3.3.5 Stages with Feedback We have seen how, with the proper use of electoral stages in protocol Stages, we can obtain a O(n log n) performance without the need of controlling the distance travelled by a message. In addition to controlled distances, protocol Control uses also a “feedback” technique: If a message successfully reaches its target, it returns back to its originator, providing it with a “positive feedback” on the situation it has encountered. Such a technique is missing in Stages: A message always successfully reaches its target (the next candidate in the direction it travels), which could be at an unpredictable distance; however, the use of the message ends there. Let us integrate the positive feedback idea in the overall strategy of Stages: When an “Election” message reaches its target, a positive feedback will be sent back to its originator if the id contained in the message is the smallest seen by the target in this stage. More precisely, when a candidate x receives Election messages containing id(y) and id(z) from its neighboring candidates, y = r(i, x) and z = l(i, x), it will send a (positive) “feedback” message: to y if id(y) < Min{id(x), id(z)}, to z if id(z) < Min{id(x), id(y)}, and to none otherwise. A candidate will then survive this stage and enter the new one if and only if it receives a feedback from both sides. In the example of Figure 3.17, candidates with ids 2, 5, and 8 will not send any feedback; of these three, only candidate with id 2 will enter next stage. The fate of entity with id 7 depends on its other neighboring candidate, which is not shown; so, we do not know whether it will survive or not. If a node sends a “feedback” message, it knows that it will not survive this stage. This is the case, for example, of the entities with ids 6, 9, 10, and 11. Some entities, however, do not send any “feedback” and wait for a “feedback” that will never arrive; this is, for example, the case of the entities with ids 5 and 8. How will such an entity discover that no “feedback” is forthcoming and it must become defeated? The answer is fortunately simple. Every entity that survives stage i (e.g., 7

9

8

10

defeated

2

6

candidate

FIGURE 3.17: Only some candidates will send a feedback.

5

11

128

ELECTION

the node with id 2) will start the next stage; its Stage message will act as a negative feedback for those entities receiving the message while still waiting in stage i. More speciﬁcally, if while waiting for a “feedback” message in stage i, an entity receives an “Election” message (clearly with a smaller id) in stage i + 1, it becomes defeated and forwards the message. We shall call the protocol Stages with Feedback; our description was assuming message ordering. As for protocol Stages, this restriction can and will be logically enforced with just local processing. Correctness The correctness and termination of the protocol follows from the fact that the entity xmin with the smallest identity will always receive a positive feedback from both sides; hence, it will never be defeated. At the same time, xmin never sends a positive feedback; hence, its left and right neighboring candidates in that stage do not survive it. In other words, the number ni of candidates in stage i is monotonically decreasing, and eventually only xmin will be in such a state. When this happens, its own “Election” messages will travel along the ring, and termination will be detected. Messages We are adding bookkeeping and additional messages to the already highly efﬁcient protocol Stages. Let us examine the effect of these changes. Let us start with the number of stages. As in Stages, if a candidate x in stage i survives, it is guaranteed that its neighboring candidates in the same stage, r(i, x) and l(i, x), will become defeated. With the introduction of positive feedback, we can actually guarantee that if x survives, neither will the ﬁrst candidate to the right of r(i, x) survive nor will the ﬁrst candidate to the left of l(i, x) survive. This is because if x survives, it must have received a “feedback” from both r(i, x) and l(i, x) (see Figure 3.18). But if r(i, x) sends “feedback” to x, it does not send one to its neighboring candidate r 2 (i, x); similarly, l(i, x) does not send a “Feedback” to l 2 (i, x). In other words, ni ≤

ni−1 3 .

That is, at most one third of the candidates starting a stage will enter the next one. As n1 = n , the total number of stages is at most σStages ≤ log3 n + 1. Note that there are initial conﬁgurations of the ids that will force the protocol to have exactly these many stages (Exercise 3.10.22).

l 2(i,x)

x

l(i,x)

defeated

r(i,x)

candidate

FIGURE 3.18: If x survives, those other candidates do not.

r2(i,x)

ELECTION IN RINGS

129

In other words, the number of stages has decreased with the use of “feedback” messages. However, we are sending more messages in each stage. Let us examine now how many messages will be sent in each stage. Consider stage i; this will be started by ni candidates. Each candidate will send an “Election” message that will travel to the next candidate on either side. Thus, exactly like in Stages, two “Election” messages will be sent over each link, one in each direction, for a total of 2n “Election” messages per stage. Consider now the “feedback” messages; a candidate sends at most one “feedback” and only in one direction. Thus, in the segment of the ring between two candidates, there will be at most one “feedback” message on each link; hence, there will be no more than n “feedback” transmissions in total in each stage. This means that in each stage there will be at most 3n messages. Summarizing, M[StagesFeedback] ≤ 3 n log3 n + O(n) ≤ 1.89 n log n + O(n).

(3.18)

In other words, the use of feedback with the electoral stages allows us to reduce the number of messages in the worst case. The use of Minimum Initiator strategy yields the similar result: M[StagesFeedback–Minit] ≤ 1.89 n log k∗ + O(n).

(3.19)

In the analysis of the number of “feedback” messages sent in each stage, we can be more accurate; in fact, there are some areas of the ring (composed of consecutive defeated entities between two successive candidates) where no feedback messages will be transmitted at all. In the example of Figure 3.17, this is the case of the area between the candidates with ids 8 and 10. The number of these areas is exactly equal to the number ni+1 of candidates that survive this stage (Exercise 3.10.19). However, the savings are not enough to reduce the constant in the leading term of the message costs (Exercise 3.10.21). Granularity of Analysis: Bit Complexity The advantage of protocol Stages with Feedback becomes more evident when we look at communication costs at a ﬁner level of granularity, focusing on the actual size of the messages being used. In fact, while the “Election” messages contain values, the “feedback” messages are just signals, each containing O(1) bits. (Recall the discussion in Section 3.2.) In each stage, only the 2n “Election” messages carry a value, while the other n are signals; hence, the total number of bits transmitted will be at most 2 n (c + log id) log3 n + n c log3 n + l.o.t., where id denotes the largest value sent in a message, c = O(1) denotes the number of bits required to distinguish among the different types of message, and l.o.t. stands for “lower order terms.” That is, B[StageswithFeedback] ≤ 1.26 n log n log id + l.o.t.

(3.20)

130

ELECTION

The improvement on the bit complexity of Stages, where every message carries a value, is, thus, in the reduction of the constant from 2 to 1.26. Further Improvements? The use of electoral stages allows us to transform the election process into one of successive “eliminations,” reducing the number of candidates at each stage. In the original protocol Stages, each surviving candidate will eliminate its neighboring candidate on each side, guaranteeing that at least half of the candidates are eliminated in each stage. By using feedback, protocol Stages with Feedback extends the “reach” of a candidate also to the second neighboring candidate on each side, ensuring that at least two third of the candidates are eliminated in each stage. Increasing the “reach” of a candidate during a stage will result in a larger proportion of the candidates in each stage, thus, reducing the number of stages. So, intuitively, we would like a candidate to reach as far as possible during a stage. Obviously the price to be paid is the additional messages required to implement the longer reach. In general, if we can construct a protocol that guarantees a reduction rate of at least b, that is, ni ≤ ni−1 b , then the total number of stages would be logb (n); if the messages transmitted in each stage are at most an, then the overall complexity will be a a n logb (n) = n log n. log b To improve on Stages with Feedback, the reduction must be done with a number of messages such that loga b < 1.89. Whether this is possible or not is an open problem (Problem 3.10.3). 3.3.6 Alternating Steps It should be clear by now that the road to improvement, on which creative ingenuity will travel, is oftentimes paved by a deeper understanding of what is already available. A way to achieve such an understanding is by examining the functioning of the object of our improvement in “slow motion,” so as to observe its details. Let us consider protocol Stages. It is rather simple and highly efﬁcient. We have already shown how to achieve improvements by extending the “reach” of a candidate during a stage; in a sense, this was really “speeding up” the functioning of the protocol. Let us examine now Stages instead by “slowing down” its functioning. In each stage, a candidate sends its id in both directions, receives an id from each direction, and decides whether to survive, be elected, or become defeated on the basis of its own value and the received ones. Consider the example shown in Figure 3.19; the result of stages will result in candidates w, y, and v being eliminated and x and z surviving; the fate of u will depend on its right candidate neighbor, which is not shown. We can obviously think of “sending in both directions” as two separate steps: send to one direction (say “right”) and send to the other. Assume for the moment that the ring is oriented: “right” has the same meaning for all entities. Thus, the stage can be thought of having two steps: (1) The candidate sends to the “right” and receives from the “left”; (2) it will then send to the “left” and receive from the “right.”

ELECTION IN RINGS

8

7

9

3

10

6

w

x

y

z

v

u

defeated

131

candidate

FIGURE 3.19: Alternating Steps: slowing down the execution of Stages.

Consider the ﬁrst step in the same example as shown in Figure 3.19; both candidates y and v already know at this time that they would not survive. Let us take advantage of this “early” discovery. We will use each of these two steps to make an electoral decision, and we will eliminate a candidate after step (1) if it receives a smaller id in this step. Thus, a candidate will perform step (2) only if it is not eliminated in step (1). The advantage of doing so becomes clear observing that by eliminating candidates in each step of a phase, we eliminate more than that in the original phase; in the example of Figure 3.19, also x will be eliminated. Summarizing, the idea is that at each step, a candidate sends only one message with its value, waits for one message, and decides on the basis of its value and the received one; the key is to alternate at each step the direction in which messages are sent. This protocol, which we shall call Alternate, is shown in Figure 3.20, where close and open denote the operation of closing a port (with the effect of enqueueing incoming messages) and opening a closed port (dequeueing the messages), respectively; and the procedures Initialize and Process Message are shown in Figure 3.21. Correctness The correctness of the protocol follows immediately from observing that, as usual, the candidate xmin with the smallest value will never be eliminated and that, on the contrary, it will in each step eliminate a neighboring candidate. Hence, the number of candidates is monotonically decreasing in the steps; when only xmin is left, its message will complete the tour of the ring transforming it into the leader. The ﬁnal notiﬁcation will ensure proper termination of all entities. Costs To determine the cost is slightly more complex. There are exactly n messages transmitted in each step, so we need to determine the total number of steps σAlternate (or, where no confusion arises, simply σ ) until a single candidate is left, in the worst case, regardless of the placement of the ids in the ring, time delays, and so forth. Let ni be the candidate entities starting step i; clearly n1 = n and nσ = 1. We know that two successive steps of Alternate will eliminate more candidates than a single stage of Stages; hence, the total number of steps will be less than twice the number of stages of Stages: σ < 2 log n. We can, however, be more accurate regarding the amount of elimination performed in two successive steps.

132

ELECTION

PROTOCOL Alternate.

States: S = {ASLEEP, CANDIDATE, DEFEATED, FOLLOWER, LEADER}; SINIT = {ASLEEP}; STERM = {FOLLOWER, LEADER}.

Restrictions: IR ∪OrientedRing ∪ MessageOrdering. ASLEEP Spontaneously begin INITIALIZE; become CANDIDATE; end Receiving("Election", id*, step*) begin INITIALIZE; become CANDIDATE; PROCESS MESSAGE; end CANDIDATE Receiving("Election", id*, step*) begin if id* = id(x) then PROCESS MESSAGE; else send(Notify) to N(x); become LEADER; end DEFEATED Receiving() begin send() to other; if = Notify then become FOLLOWER endif; end

FIGURE 3.20: Protocol Alternate.

Assume that in step i, the direction is “right” (thus, it will be “left” in step i + 1). Let di denote the number of candidates that are eliminated in step i. Of those ni candidates that start step i, di will be defeated and only ni+1 will survive that step. That is, ni = di + ni+1 Consider a candidate x that survives both step i and step i + 1. First of all observe that the candidate to the right of x in step i will be eliminated in that step. (If not, it would mean that its id is smaller than id(x) and thus would eliminate x in step i + 1; but we know that x survives.) This means that every candidate that, like x, survives both stages will eliminate one candidate in the ﬁrst stage; in other words, di ≥ ni+2 ,

ELECTION IN RINGS

133

Procedure INITIALIZE begin step:= 1; min:= id(x); send("Election", id(x), step) to right; close(right); end Procedure PROCESS MESSAGE begin if id*< min then open(other); become DEFEATED; else step:= step+1; send("Election", id(x), step) to sender; close(sender); open(other); endif end

FIGURE 3.21: Procedures used by protocol Alternate.

but then ni ≥ ni+1 + ni+2 .

(3.21)

The consequence of this fact is very interesting. In fact, we know that nσ = 1 and, obviously, nσ −1 ≥ 2. From Equation 3.21, we have nσ −i ≥ nσ −i+1 + nσ −i+2 . Consider now the Fibonacci numbers Fj deﬁned by Fj = Fj +1 + Fj +2 , where F−1 = 0 and F0 = 1. Then, clearly nσ −i ≥ Fi+1 . It follows that n1 ≥ Fσ , but n1 = n; thus σ is the index of the largest Fibonacci number not exceeding n. This helps us in achieving our goal of determining σ , the 1+√5 j number of steps until there is only one candidate left. As Fj = b , where b 2 is a positive constant, we have n ≥ Fσ = b

√ σ 1+ 5 2

from where we get, σAlternate ≤ 1.44 log n + O(1). That means that after at most so many steps, there will be only one candidate left. Observe that what we have derived is actually achievable. In fact, there are allocations of the ids to the nodes or a ring that will force the protocol to perform σAlternate steps before there is only one candidate left (Exercise 3.10.26). In the next step, this

134

ELECTION

candidate will become leader and start the notiﬁcation. These last two operations require n messages each. Thus the total number of messages will be M[Alternate] ≤ 1.44 n log n + O(n).

(3.22)

In other words, protocol Alternate is not only simple but also more efﬁcient than all other protocols seen so far. Recall, however, that it has been described and analyzed assuming that the ring is oriented. Question. What happens if the ring is not oriented ? If the entities have different meaning for “right,” when implementing the ﬁrst step, some candidates will send messages clockwise while others in a counterclockwise direction. Notice that in the implementation for oriented rings described above, this would lead to deadlock because we close the port we are not waiting to receive from; the implementation can be modiﬁed so that the ports are never closed (Exercise 3.10.24). Consider this to be the case. It will then happen that a candidate waiting to receive from “left” will instead receive from “right.” Call this situation a conﬂict. What we need to do is to add to the protocol a conﬂict resolution mechanism to cope with such situations. Clearly this complicates the protocol (Problem 3.10.2). 3.3.7 Unidirectional Protocols The ﬁrst two protocols we have examined, All the Way and AsFar, did not really require the restriction Bidirectional Links; in fact, they can be used without any modiﬁcation in a directed or a unidirectional ring. The subsequent protocols Distances, Stages, Stages with Feedback, and Alternate all used the communication links in both directions, for example, for obtaining feedback. It was through them that we have been able to reduce the costs from O(n2 ) to a guaranteed O(n log n) messages. The immediate and natural question is as follows: Question. Is “Bidirectional Links” necessary for a O(n log n) cost ? The question is practically relevant because if the answer is positive, it would indicate that an additional investment in communication hardware (i.e., full duplex lines) is necessary to reduce the operating costs of the election task. The answer is important also from a theoretical point of view because if positive, it would clearly indicate the “power” of the restriction Bidirectional Links. Not surprisingly, this question has attracted the attention of many researchers. We are going to see now that the answer is actually No.

ELECTION IN RINGS

135

We are also going to see that, strangely enough, we know how to do better with unidirectional links than with bidirectional ones. First of all, we are going to show how the execution of protocols Stages and Alternate can be simulated in unidirectional links yielding the same (if not better) complexity. Then, using the lessons learned in this process, we are going to develop a more efﬁcient unidirectional solution. Unidirectional Stages What we are going to do is to show how to simulate the with the same message costs. execution of protocol Stages in unidirectional rings R, Consider how protocol Stages works. In a stage, a candidate entity x 1. sends a message carrying a value (its id) in both directions and thus receives a message with the value (the id) of another candidate from each directions, and then, 2. on the basis of these three values (i.e., its own and the two received ones), makes a decision on whether it (and its value) should survive this stage and start the next stage. Let us implement each of these two steps separately. Step (1) is clearly the difﬁcult one because, in a unidirectional ring, messages can only be sent in one direction. Decompose the operation “send in both directions” into two substeps: (I) “send in one direction” and then (II) “send in the other direction.” as a result, every candidate will receive Substep (I) can be executed directly in R; a message with the value of its neighboring candidate from the opposite direction (see Figure 3.22 c). The problem is in implementing now substep (II); as we cannot send information in the other direction, we will send information again in the same direction, and, as it is meaningless to send again the same information, we will send the information we just received. As a result, every candidate will receive now the value of another candidate from the opposite direction (see Figure 3.22d). has now three values at its disposal: the one it started with plus Every entity in R the two received ones. We can now proceed to implement Step (2). To simulate the bidirectional execution, we need that a candidate decides on whether to survive or to as in the bidirectional become passive on the basis of exactly the same information in R case. Consider the initial conﬁguration in the example shown in Figure 3.22 and focus on the candidate x with starting value 7; in the bidirectional case, x decides that the value 7 should survive on the basis of the information: 7, 15, and 8. In the unidirectional case, after the implementation of Step (1), x knows now 4 and 15 in addition to 7. This is not the same information at all. In fact, it would lead to totally different decisions in the two cases, destroying the simulation. a candidate that, at the end of Step (1), has exactly the There is, however, in R same information that x has at the end of Step (1) in the bidirectional case: This is the candidate that started with value 8. In fact, the information available in R exists in R (compare carefully Figures 3.22 (b) and (d)), but it is shifted to the “next” candidate as in R; they in the ring direction. It is, thus, possible to make the same decisions in R will just have to be made by different entities in the two cases.

136

ELECTION

8

5 11

8

11 5

5 8 7

7

9

8

5 11 9 11

7

9

15

12

15

12

12 9 4

7 15 4

4

4 15

(a)

7

(b)

8 5 8

11

7 15 8

5

9 11

15 7

12

15

9

4

15 4

7 8 5

7

(c)

8 11 5

5

9

11

12 11 9

4 15 12

4 12

12

4 12

9

(d)

FIGURE 3.22: (a) Initial conﬁguration; (b) information after the ﬁrst full stage of Stages with Bidirectional Links; (c) information after ﬁrst substep in the unidirectional simulation; (d) information after the second substep.

In each stage, a candidate makes a decision on a value. In protocol Stages, this value was always the candidate’s id. In the unidirectional algorithm, this value is not the id; it is the ﬁrst value sent by its neighboring candidate in Step (1). We will call this value the envelope. IMPORTANT. Be aware that unless we add the assumption Message Ordering, it is possible that the second value arrives before the envelope. This problem can be solved (e.g., by locally enqueueing out-of-order messages). It is not difﬁcult to verify that the simulation is exact: In each stage, exactly the as in R; thus, the number of stages is exactly the same. same values survive in R

ELECTION IN RINGS

137

PROTOCOL UniStages.

States: S = {ASLEEP, CANDIDATE, DEFEATED, FOLLOWER, LEADER}; SINIT = {ASLEEP}; STERM = {FOLLOWER, LEADER}.

Restrictions: IR ∪U nidirectionalRing. ASLEEP Spontaneously begin INITIALIZE; become CANDIDATE; end Receiving("Election", value*, stage*,order*) begin send ("Election", value*, stage*, order*); become DEFEATED; end CANDIDATE Receiving("Election", value*, stage*, order*) begin if value* = value1 then PROCESS MESSAGE; else send(Notify); become LEADER; end DEFEATED Receiving() begin send(); if = Notify then become FOLLOWER endif; end

FIGURE 3.23: Protocol UniStages.

The cost of each stage is also the same: 2n messages. In fact, each node will send (or forward) exactly two messages. In other words, M[UniStages] ≤ 2 n log n + O(n).

(3.23)

This shows that O(n log n) guaranteed message costs can be achieved in ring networks also without Bidirectional Links. The corresponding protocol UniStages is shown in Figure 3.23, described not as a unidirectional simulation of Stages (which indeed it is) but directly as a unidirectional protocol. NOTES. In this implementation, 1. we elect a leader only among the initiators (using approach Minimum Initiator); 2. Message Ordering is not assumed; within a stage, we use a Boolean variable, in order to distinguish between value and envelope and to cope with messages

138

ELECTION

from different stages arriving out of order: If a candidate receives a message from the “future” (i.e., with a higher stage number), it will be transformed immediately into defeated and will forward the message. Unidirectional Alternate We have shown how to simulate Stages in a unidirectional ring, achieving exactly the same cost. Let us focus now on Alternate; this protocol makes full explicit use of the full duplex communication capabilities of the bidirectional ring by alternating direction at each step. Surprisingly, it is possible to . achieve an exact simulation also of this protocol in a unidirectional ring R Consider how protocol Alternate works. In a “left” step, 1. a candidate entity x sends a message carrying a value v(x) to the “left”, and receives a message with the value of another candidate from the “right”;

Procedure INITIALIZE begin stage:= 1; count:= 0; order:= 0; value1:= id(x); send("Election", value1, stage, order); end Procedure PROCESS MESSAGE begin if stage* = stage then if order* = 0 then envelope:= value*; order:= 1; send ("Election", value*, stage*, order); else value2:= value*; endif count:=count+1; if count=2 then if envelope < Min(value1, value2) then order:= 0; count:= 0; stage:= stage+1; value1:= envelope; send ("Election", value1, stage, order); else become DEFEATED; endif endif else if stage* > stage then send ("Election", value*, stage*, order*); become DEFEATED; endif endif end

FIGURE 3.24: Procedures used by protocol UniStages.

ELECTION IN RINGS

13 5 7

139

9 5

5

13

11 11

7

9

5

7

9 8

15

12 9

15 8

8

8 12

7

(a)

(b)

13 5

7 13

7 5

5 11

15 7

9 11

8

9

7

5

12

15

9

8 8

8

12 (c)

(d)

FIGURE 3.25: (a-b) Information after (a) the ﬁrst step and (b) the second step of Alternate in an oriented bidirectional ring. (c-d) Information after (c) the ﬁrst step and (d) the second step of the unidirectional simulation.

2. on the basis of these two values (i.e., its own and the received one), x makes a decision on whether it (and its value) should survive this step and start the next step. The actions in a “right” step are the same except that “left” and “right” are interchanged. shown in Figure 3.25, and assume we can send messages only Consider the ring R to “right”. This means that the initial “right” step can be trivially implemented: Every entity will send a value (its own) and receive another; it starts the next step if and only if the value it receives is not smaller that its own.

140

ELECTION

Let us concentrate on the “left” step. As a candidate cannot send a value to the left, it will have to send the value to the “right”. Let us do so. Every candidate in R has now two values at its disposal: the one it started with and the received one. To simulate the bidirectional execution, we need that a candidate makes a decision on whether to survive or to become passive on the basis of exactly the same information as in the bidirectional case. Consider the initial conﬁguration in the example in R shown in Figure 3.25. First of all observe that the information in the “right” step is the same both in the bidirectional (a) and in the unidirectional (c) case. The differences occur in the “left” step. Focus on the candidate x with starting value 7; in the second step of the bidirectional case, x decides that the value 7 should not survive on the basis of the information: 5 and 7. In the unidirectional case, after the second step, x knows now 7 and 8. This is not the same information at all. In fact, it would lead to totally different decisions in the two cases, destroying the simulation. a candidate that, at the end of the second step, has exactly the There is, however, in R same information that x has in the bidirectional case: This is the candidate that started with value 5. As we have seen already in the simulation of Stages, the information (compare carefully Figures 3.25(b) and (d)). It is, thus, available in R exists in R as in R; they will just have to be made by possible to make the same decisions in R different entities in the two cases. Summarizing, in each step, a candidate makes a decision on a value. In protocol Alternate, this value was always the candidate’s id. In the unidirectional algorithm, this value changes depending on the step. Initially, it is its own value; in the “left” step, it is the value it receives; in the “right” step, it is the value it already has. In other words, 1. in the “right” step, a candidate x survives if and only if the received value is larger than v(x); 2. in the “left” step, a candidate x survives if and only if the received value is smaller than v(x), and if so, x will now play for that value. Working out a complete example will help clarify the simulation process and dispel any confusion (Exercise 3.10.33). IMPORTANT. Be aware that unless we add the assumption Message Ordering, it is possible that the value from step i + 1 arrives before the value for step i. It is not difﬁcult to verify that the simulation is exact: In each step, exactly the as in R; thus, the number of steps is exactly the same. The same values survive in R cost of each step is also the same: n messages. Thus, M[UniAlternate] ≤ 1.44 n log n + O(n).

(3.24)

The unidirectional simulation of Alternate is shown in Figure 3.26; it has been simpliﬁed so that we elect a leader only among the initiators, and assuming Message

ELECTION IN RINGS

141

PROTOCOL UniAlternate.

States: S = {ASLEEP, CANDIDATE, DEFEATED, FOLLOWER, LEADER}; SINIT = {ASLEEP}; STERM = {FOLLOWER, LEADER}.

Restrictions: IR ∪U nidirectionalRing ∪ MessageOrdering. ASLEEP Spontaneously begin INITIALIZE; become CANDIDATE; end Receiving("Election", value*, stage*,order*) begin send ("Election", value*, stage*, order*); become DEFEATED; end CANDIDATE Receiving("Election", value*, stage*) begin if value* = value then PROCESS MESSAGE; else send(Notify); become LEADER; end DEFEATED Receiving() begin send(); if = Notify then become FOLLOWER endif; end

FIGURE 3.26: Protocol UniAlternate.

Ordering. The protocol can be modiﬁed to remove this assumption without changes in its cost (Exercise 3.10.34). The procedures Initialize and Prepare Message are shown in Figure 3.27. An Alternative Approach In all the solutions we have seen so far, both for unidirectional and bidirectional rings, we have used the same basic strategy of minimum ﬁnding; in fact in all of the protocols so far, we have elected as a leader the entity with the smallest value (either among all the entities or among just the initiators). Obviously, we could have used maximum ﬁnding in those solution protocols, just substituting the function Min with Max and obtaining the exact same performance. A very different approach consists in mixing these two strategies. More precisely, consider the protocols based on electoral stages. In all of them, what we could do is to alternate strategy in each stage: In “odd” stages we use the function Min, and in “even” stages we use the function Max. Call this approach min-max.

142

ELECTION

Procedure INITIALIZE begin step:= 1; direction:= "right"; value:= id(x); send("Election", value, step, direction); end Procedure PROCESS MESSAGE begin if direction = "right" then if value < value* then step:= step+1; direction:= "left"; send ("Election", value, step, direction); else become DEFEATED; endif else if value > value* then step:= step+1; direction:= "right"; send ("Election", value, step, direction); else become DEFEATED; endif endif end

FIGURE 3.27: Procedures used by protocol UniAlternate.

It is not difﬁcult to verify that all the stage-based protocols we have seen so far, both bidirectional and unidirectional, still correctly solve the election problem; moreover, they do so with the same costs as before (Exercises 3.10.11, 3.10.23, 3.10.28, 3.10.31, 3.10.36). The interesting and surprising thing is that this approach can lead to the design of a more efﬁcient protocol for unidirectional rings. The protocol we will construct has a simple structure. Let us assume that every entity starts and that there is Message Ordering (we will remove both assumptions later). 1. Each initiator x becomes candidate, prepares a message containing its own value id(x) and the stage number i = 1, and sends it (recall, we are in a unidirectional ring, so there is only one out-neighbor); x is called the originator of this message and remembers its content. 2. When a message with value b arrives at a candidate y, y compares the received value b with the value a it sent in its last message. (a) If a = b, the message originated by y has made a full trip around the ring; y becomes the leader and notiﬁes all other entities of termination. (b) If a = b, the action y will take depends on the stage number j : (i) if j is “even,” the message is discarded if and only if a < b (i.e., b survives only if max);

ELECTION IN RINGS

(9, 2)

(11, 2) 11

(10, 2) 10

(20, 2) 20

(22, 2) 22

143

(13, 2) 13

(a)

(12, 3)

(11, 3)

(22, 3)

11

22 (b)

(21, 4)

(11, 4) 11 (c)

FIGURE 3.28: Protocol MinMax: (a) In an even stage, a candidate survives only if it receives an envelope with a larger value; (b) it then generates an envelope with that value and starts the next stage; (c) in an odd stage, a candidate survives only if it receives an envelope with a smaller value; if so, it generates an envelope with that value and starts the next stage.

(ii) if j is “odd,” the message is discarded if and only if a > b (i.e., b survives only if min). If the message is discarded, y becomes defeated; otherwise, y will enter the next stage: Originate a message with content (b, j + 1) and send it. 3. A defeated entity will, as usual, forward received messages. For example, see Figure 3.28. The correctness of the protocol follows from observing that, (a) in an even stage i, the candidate x receiving the largest of all values in that stage, vmax (i), will survive and enter the next stage; by contrast, its “predecessor” l(i, x) that originated that message will become defeated (Exercise 3.10.37), and (b) in an odd stage j , the candidate y receiving the smallest of all values in that stage, vmin (j ), will survive and enter the next stage; furthermore, its “predecessor” l(j, y) that originated that message will become defeated. In other words, in each stage at least one candidate will survive that stage, and the number of candidates in a stage is monotonically decreasing with the number of stages. Thus, within ﬁnite time, there will be only one candidate left; when that happens, its message returns to it transforming it into a leader.

144

ELECTION

IMPORTANT. Note that the entity that will be elected leader will be neither the one with the smallest value nor the one with the largest value. Let us now consider the costs of this protocol, which we will call MinMax. In a stage, each candidate sends a message that travels to the next candidate. In other words, in each stage there will be exactly n messages. Thus, to determine the total number of messages, we need to compute the number σMinMax of stages. We can rephrase the protocol in terms of values instead of entities. Each value sent in a stage j travels from its originator to the next candidate in stage j . Of all these values, only some will survive and will be sent in the next stage: In an even stage, a value survives if it is larger than its “successor” (i.e., the next value in the ring in also this stage); similarly, in an odd stage, it survives if it is smaller than its successor. Let ni be the number of values in stage i; of those, di will be discarded and ni+1 will be sent in the next stage. That is, ni+1 = ni − di . Let i be an odd (i.e., min) stage, and let value v survive this stage; this means that the successor of v in stage i, say u, is larger than v that is, u >v. Let v survive also stage i + 1 (an even, i.e., max, stage). This implies v must have been discarded in stage i: If not, the entity that originates the message (i + 1, u) would discard (i + 1, v) because u > v, but we know that x survives this stage. This means that every value that, like v, survives both stages will eliminate one value in the ﬁrst of the two stages; in other words, ni+2 ≤ di , but then ni ≥ ni+1 + ni+2 .

(3.25)

Notice that this is exactly the same equation as the one (Equation 3.21) we derived for protocol Alternate. We thus obtain that σMinMax ≤ 1.44 log n + O(1). After at most these many stages, there will be only one value left. Observe that this bound we have derived is actually achievable. In fact, there are allocations of the ids to the nodes or a ring, which will force the protocol to perform σMinMax steps before there is only one value left (Exercise 3.10.38). The candidate sending this value will receive its message back and become leader; it will then start the notiﬁcation. These last two steps require n messages each; thus the total number of messages will be M[MinMax] ≤ 1.44 n log n + O(n).

(3.26)

ELECTION IN RINGS

145

PROTOCOL MinMax

States: S = {ASLEEP, CANDIDATE, DEFEATED, FOLLOWER, LEADER}; SINIT = {ASLEEP}; STERM = {FOLLOWER, LEADER}.

Restrictions: IR ∪U nidirectionalRing ∪ MessageOrdering. ASLEEP Spontaneously begin stage:= 1; value:= id(x); send("Envelope", value, stage); become ORIGINATOR; end Receiving("Envelope", value*, stage*) begin send ("Envelope", value*, stage*); become DEFEATED; end CANDIDATE Receiving("Envelope", value*, stage*) begin if value* = value then PROCESS ENVELOPE; else send(Notify); become LEADER; end

DEFEATED Receiving("Envelope", value*, stage*) begin send("Envelope", value*, stage*); end Receiving("Notify") begin send ("Notify"); become FOLLOWER; end

FIGURE 3.29: Protocol MinMax.

In other words, we have been able to obtain the same costs of UniAlternate with a very different protocol, MinMax, described in Figure 3.29. We have assumed that all entities start. When removing this assumption we have two options: The entities that are not initiators can be (i) made to start (as if they were initiators) upon receiving their ﬁrst message or (ii) transformed into passive and just act as relayers. The second option is the one used in Figure 3.29. We have also assumed Message Ordering in our discussion. As with all the other protocols we have considered, this restriction can be enforced with just local bookkeeping at each entity, without any increase in complexity (Exercise 3.10.39).

146

ELECTION

Procedure PROCESS ENVELOPE begin if odd(stage*) then if value* < value then stage= stage+1; value:= value*; send ("Envelope", value*, stage); else become DEFEATED; else if value* > value then stage= stage+1; value:= value*; send ("Envelope", value, stage); else become DEFEATED; endif endif end

FIGURE 3.30: Procedure Process Envelope of Protocol MinMax.

Hacking: Employing the Defeated () The different approach used in protocol MinMax has led to a different way of obtaining the same efﬁciency as we had already with UniAlternate. The advantage of MinMax is that it is possible to obtain additional improvements that lead to a signiﬁcantly better performance. Observe that like in most previous protocols, the defeated entities play a purely passive role, that is, they just forward messages. The key observation we will use to obtain an improvement in performance is that these entities can be exploited in the computation. Let us concentrate on the even stages and see if we can obtain some savings for those steps. The message sent by a candidate travels (forwarded by the defeated entities) until it encounters the next candidate. This distance can vary and can be very large. What we will do is to control the maximum distance to which the message will travel, following the idea we developed in Section 3.3.3. (I) in an even step j , a message will travel no more than a predeﬁned distance dis(j ). This is implemented by having in the message a counter (initially set to dis(j )) that will be decreased by one by each defeated node it passes. What is the appropriate choice of dis(i) will be discussed next. Every change we make in the protocol has strong consequences. As a consequence of (I ), the message from x might not reach the next candidate y if it is too far away (more than dis(j )) (see Figure 3.31). In this case, the candidate y does not receive the message in this stage and, thus, does not know what to do for the next stage. IMPORTANT. It is possible that every candidate is too far away from the next one in this stage, and hence none of them will receive a message.

ELECTION IN RINGS

x

z

y

z

y

147

(a)

x (b)

FIGURE 3.31: Protocol MinMax+. Controlling the distance: In even stage j , the message does not travel more than dis(j ) nodes. (a) If it does not reach the next candidate y, the defeated node reached last, z, will become candidate and start the next step; (b) in the next step, the message from z transforms into defeated the entity y still waiting for the stage j message.

However, if candidate y does not receive the message from x, it is because the counter of the message containing (v, j) reaches 0 at a defeated node z, on the way from x to y (see Figure 3.31). To ensure progress (i.e., absence of deadlock), we will make that defeated z become candidate and start the next stage j + 1 immediately, sending (v, j+1). That is, (II) in an even step j , if the counter of the message reaches 0 at a defeated node z, then z becomes candidate and starts stage j + 1 with value = v*, where v* is the value in the transfer message. In other words, we are bringing some defeated nodes back into the game making them candidates again. This operation could be dangerous for the complexity of the protocol as the number of candidates appears to be increasing (and not decreasing). This is easily taken care of: The originators, like y, waiting for a transfer message that will not arrive will become defeated. Question. How will y know that it is defeated? The answer is simple. The candidate that starts the next stage (e.g., z in our example) sends a message; when this message reaches a candidate (e.g., y) still waiting for a message from the previous stage, that entity will understand, become defeated, and forward the message. In other words, (III) when, in an even step, a candidate receives a message for the next step, it becomes defeated and forwards the message. We are giving decisional power to the defeated nodes, even bringing some of them back to “life.” Let us push this concept forward and see if we can obtain some other savings. Let us concentrate on the odd stages.

148

ELECTION

Consider an even stage i in MinMax (e.g., Figure 3.28). Every candidate x sends its message containing the value and the stage number and receives a message; it becomes defeated if the received value is smaller than the one it sent. If it survives, x starts stage i + 1: It sends a message with the received value and the new stage number (see Figure 3.28(b)); this message will reach the next candidate. Concentrate on the message (11, 3) in Figure 3.28(b) sent by x. Once (11, 3) reaches its destination y, as 11 < 22 and we are in a odd (i.e., min) stage, a new message (11, 4) will be originated. Observe that the fact that (11, 4) must be originated can be discovered before the message reaches y (see Figure 3.32(c)). In fact, on its travel from x to y, message (11, 3) will reach the defeated node z that originated (20, 2) in the previous stage; once this happens, z knows that 11 will survive this stage (Exercise 3.10.40). What z will do is to become candidate again and immediately send (11, 4). (IV) When, in an even stage, a candidate becomes defeated, it will remember the stage number and the value it sent. If, in the next stage, it receives a message with a smaller value, it will become candidate again and start the next stage with that value. In our example, this means that the message (11, 3) from x will stop at z and never reach y; thus, we will save d(z, y) messages. Notice that in this stage every message with a smaller value will be stopped earlier. We have, however, transformed a defeated entity into a candidate. This operation could be dangerous for the complexity of the

(9, 2)

(11, 2) 11

(10, 2)

(20, 2)

10

20

x

z

(22, 2) 22

(13, 2) 13 y

(a)

(12, 3)

(11, 3)

(22, 3)

11

x

22

z

y

(b) (12, 3)

(11, 4) 11

x

z

y

(c)

FIGURE 3.32: Protocol MinMax+. (a) Early promotion in odd stages. (b) The message (11, 3) from x, on its way to y, reaches the defeated node z that originated (20, 2). (c) Node z becomes candidate and immediately originates envelope (11, 4).

ELECTION IN RINGS

149

protocol as the number of candidates appears to be increasing (and not decreasing). This is easily taken care of: This candidates, like y, waiting for a message of an odd stage that will not arrive will become defeated. How will y know that is defeated ? The answer again is simple. The candidate that starts the next stage (e.g., z in our example) sends the message; when this message reaches an entity still waiting for a message from the previous stage (e.g., y), that entity will understand, become defeated, and forward the message. In other words, (V) When, in an odd step, a candidate receives a message for the next step, it becomes defeated and forwards the message. The modiﬁcations to MinMax described by (I)–(V) generate a new protocol that we shall call MinMax+ (Exercises 3.10.41 and 3.10.42). Messages Let us estimate the cost of protocol MinMax+. First of all observe that in protocol MinMax, in each stage a message (v, i) would always reach the next candidate in that stage. This is not necessarily so in MinMax+. In fact, in an even stage i no message will travel more than dis(i), and in an odd stage a message can be “promoted” by a defeated node on the way. We must concentrate on the savings in each type of stages. Consider a message (v, i); denote by hi (v) the candidate that originates it, and if the message is discarded in this stage, denote by gi (v) the node that discards it. For the even stages, we must ﬁrst of all choose the maximum distance dis(i) a message will travel. We will use dis(i) = Fi+2 With this choice of distance, we have a very interesting property. Property 3.3.1 Let i be even. If message (v, i) is discarded in this stage, then d(hi (v), gi (v)) ≥ Fi . For any message (v, i + 1), d(hi (v), hi+1 (v)) ≥ Fi+1 . This property allows us to determine the number of stages σMinMax+ : In an even stage i, the distance traveled by any message is at least Fi ; however, none of these messages travels beyond the next candidate in the ring. Hence, the distance between two successive candidates in an odd stage i is at least Fi ; this means that the number ni of candidates is at most ni ≤ Fni . Hence, the number of stages will be at most Fn−1 + O(1), where Fn−1 is the smallest integer j such that Fj ≥ n. Thus the algorithm will use at most σMinMax+ ≤ 1.44 log n + O(1) stages. This is the same as protocol MinMax.

150

ELECTION

The property also allows us to measure the number of messages we save in the odd stages. In our example of Figure 3.32(b), message (11, 3) from x will stop at z and never reach y; thus, we will save d(z, y) transmissions. In general, a message with value v that reaches an even stage i + 1 (e.g., (11, 4)) saves at least Fi transmissions in stage i (Exercise 3.10.44). The total number of transmissions in an odd stage i is, thus, at most n − ni+1 Fi , where ni+1 denotes the number of candidates in stage i + 1. The total number of messages in an even stage is at most n. As in an even stage i + 1 each message travels at most Fi+3 (by Property 3.3.1), the total number of message transmissions in an even stage i + 1 will be at most ni+1 Fi+3 . Thus, the total number of messages in an even stage i + 1 is at most Min{n, ni+1 Fi+3 }. If we now consider an odd stage i followed by an even stage i + 1, the total number of message transmissions in the two stages will be at most i < n(4 − Min{n + ni+1 (Fi+3 − Fi ), 2n − ni+1 Fi } ≤ 2n − n FFi+3

where φ =

√ 1+ 5 2 .

√

5 + φ −2i ),

Hence,

√ 4− 5 M[MinMax+] ≤ n logφ (n) + O(n) < 1.271 n log n + O(n). 2

(3.27)

Thus, protocol MinMax+ is the most efﬁcient protocol we have seen so far, with respect to the worst case. 3.3.8 Limits to Improvements () Throughout the previous sections, we have reduced the message costs further and further using new tools or combining existing ones. A natural question is how far we can go. Considering that the improvements have only been in the multiplicative constant of the n log n factor, the next question becomes: Is there a tool or a technique that would allow us to reduce the message costs for election signiﬁcantly, for example, from O(n log n) to O(n)? These type of questions are all part of a larger and deeper one: What is the message complexity of election in a ring ? To answer this question, we need to establish a lower bound, a limit that no election protocol can improve upon, regardless of the amount and cleverness of the design effort. In this section we will see different bounds, some for unidirectional rings and others for bidirectional ones, depending on the amount of a priori knowledge the

ELECTION IN RINGS

151

entities have about the ring. As we will see, in all cases, the lower bounds are all of the form ⍀(n log n). Thus, any further improvement can only be in the multiplicative constant. Unidirectional Rings We want to know what is the number of messages that any election algorithm for unidirectional rings must transmit in the worst case. A subtler question is to determine the number of messages that any solution algorithm must transmit on the average; clearly, a lower bound on the average case is also a lower bound on the worst case. We will establish a lower bound under the standard assumptions of Connectivity and Total Reliability, plus Initial Distinct Values (required for election), and obviously Ring. We will actually establish the bound assuming that there is Message Ordering; this implies that in systems without Message Ordering, the bound is at least as bad. The lower bound will be established for minimum-ﬁnding protocols; because of the Initial Distinct Values restriction, every minimum-ﬁnding protocol is also an election protocol. Also, we know that with the additional n messages, every election protocol becomes a minimum-ﬁnding protocol. When a minimum-ﬁnding algorithm is executed in a ring of entities with distinct values, the total number of transmitted messages depends on two factors: communication delays and the assignment of initial values. = (x0 , x1 , . . . , xn−1 ); let si = id(xi ) be the Consider the unidirectional ring R unique value assigned to xi . The sequence s = s1 , s2 , . . . , sn , thus, describes the assignment of ids to the entities. Denote by S the set of all such assignments. Given a ring R of size n and an is labeled by s, and denote it by R(s). assignment s ∈ S of n ids, we will say that R Let A be a minimum-ﬁnding protocol under the restrictions stated above. Consider the executions of A started simultaneously by all entities and their cost. The average and the worst-case costs of these executions are possibly better but surely not worse than the average and the worst-case costs, respectively, over all possible executions; thus, if we ﬁnd them, they will give us a lower bound. Call global state of an entity x at time t, the content of all its local registers and variables at time t. As we know, the entities are event driven. This means that for a ﬁxed set of rules A, their next global state will depend solely on the current one and on what event has occurred. In our case, once the execution of A is started, the only external events are the arrival of messages. During an action, an entity might send one or more messages to its only outneighbor; if it is more than one, we can “bundle” them together as they are all sent within the same action (i.e., before any new message is received). Thus, we assume that in A, only one message is sent in the execution of an action by an entity. Associate to each message all the “history” of that message. That is, with each message M, we associate a sequence of values, called trace, as follows: (1) If the sender has id si and has not previously received any message, the trace will be just 1

The converse is not true.

152

ELECTION

si . (2) If the sender has id si and its last message previously received has trace

l1 , . . . , lk−1 , k > 1, the trace will be l1 , . . . , lk−1 , si , which has length k. Thus, a message M with trace si , si+1 , . . . , si+k indicates that a message was originally sent by entity xi ; as a reaction, the neighbor xi+1 sent a message; as a reaction, the neighbor xi+2 sent a message; . . . ; as a reaction, xi+k sent the current message M. IMPORTANT. Note that because of our two assumptions (simultaneous start by all entities and only one message per action), messages are uniquely described by their associated trace. We will denote by ab the concatenation of two sequences a and b. If d = abc, then a, b, and c are called subsequences of d; in particular, each of a, ab, and abc will be called a preﬁx of d; each of c, bc, and abc will be called a sufﬁx of d. Given a sequence a, we will denote by len(a) the length of a and by C(a) the set of cyclic permutations of a; clearly, |C(a)| = len(a). Example If d = 2, 15, 9, 27, then len(d) = 4; the subsequences 2, 2, 15,

2, 15, 9, and 2, 15, 9, 27 are preﬁxes; the sequences 27, 9, 27, 15, 9, 27, and 2, 15, 9, 27 are sufﬁxes; and C(d) = { 2, 15, 9, 27, 15, 9, 27, 2,

9, 27, 2, 15, 27, 2, 15, 9}. The key point to understand is the following: If in two different rings, for example, in R(a) and in R(b), an entity executing A happens to have the same global state, and it receives the same message, then it will perform the same action in both cases, and the next global state will be the same in both executions. Recall Property 1.6.1. Let us use this point. Lemma 3.3.1 Let a and b both contain c as a subsequence. If a message with trace c is sent in an execution of A on R(a), then c is sent in an execution of A on R(b). Proof. Assume that a message with trace c = si , . . . , si+k is sent when executing A on R(a). This means that when entity xi started the trace, it had not received any other message, and so, the transmission of this message was part of its initial “spontaneous” action; as the nature of this action depends only on A, xi will send the message both in R(a) and in R(b). This message was the ﬁrst and only message xi+1 received from xi both in R(a) and in R(b); in other words, its global state until it received the message with trace starting with si was the same in both rings; hence, it will send the same message with trace si , si+1 to xi+2 in both situations. In general, between the start of the algorithm and the arrival of a message with trace si , . . . , sj −1 , entity xj with id sj , i j ≤ i + k is in the same global state and sends and receives the same message in both R(a) and R(b); thus, it will send a message with trace si , . . . , sj −1 , sj regardless of whether the input sequence is a or b. (a) has a message with trace c, then there is an Thus, if an execution of A in R execution of A in R(b) that has a message with trace c. 䊏

ELECTION IN RINGS

153

In other words, if R(a) and R(b) have a common segment c (i.e., a consecutive group of len(c) entities in R(a) has the same ids as a consecutive group of entities in R(b)), the entity at the end of the segment cannot distinguish between the two rings when it sends the message with trace c. As different assignments of values to rings may lead to different results (i.e., different minimum values), the protocol A must allow the entities to distinguish between those assignments. As we will see, this will be the reason ⍀(n log n) messages are needed. To prove it, we will consider a set of assignments on rings, which makes distinguishing among them “expensive” for the algorithm. A set E ⊆ S of assignments of values is called exhaustive if it has the following two properties: 1. Preﬁx Property: For every sequence belonging to E, its nonempty preﬁxes also belong to E, that is, if ab ∈ E and len(a) ≥ 1, then a ∈ E. 2. Cyclic Permutation Property: Whether an assignment of values s belongs or not belongs to E, at least one of its cyclic permutations belongs to E, that is, if s ∈ S, then C(s) ∩ E = φ Lemma 3.3.2

A has an exhaustive set E(A) ⊆ S.

Proof. Deﬁne E(A) to be the set of all the arrangements s ∈ S such that a message with trace s is sent in the execution of A in R(s). To prove that this set is exhaustive, we need to show that the cycle permutation property and the preﬁx property hold. To show that the preﬁx property is satisﬁed, choose an arbitrary s = ab ∈ E(A) with len(a) ≥ 1; by deﬁnition of E(A), there will be a message with trace ab when executing A in R(ab); this means that in R(ab) there will also be a message with trace a. Consider now the (smaller) ring R(a); as a is a subsequence of both ab and (obviously) a, and there was a message with that trace in R(ab), by Lemma 3.3.1 there will be a message with trace a also in R(a); but this means that a ∈ E(A). In other words, the sufﬁx property holds. To show that the cyclic permutation property is satisﬁed, choose an arbitrary s = s1 , . . . , sk ∈ S and consider R(s). At least one entity must receive a message with a trace of length k, otherwise the minimum value could not have been determined; then t is a cyclic permutation of s. Furthermore, as t is a trace in R(t), t ∈ E(A). Summarizing, t ∈ E(A) ∪ S(s). In other words, the cyclic permutation property holds. 䊏 Now we are going to measure how expensive it is for the algorithm A to distinguish between the elements of E(A). Let m(s, E) be the number of sequences in E ⊆ S, which are preﬁxes of some cyclic permutation of s ∈ S, and mk (s, E) denote the number of those that are of length k > 1. costs at least m(s, E(A)) messages. Lemma 3.3.3 The execution of A in R(s)

154

ELECTION

Proof. Let t ∈ E(A) be the preﬁx of some r ∈ C(s). That is, a message with trace and because of Lemma 3.3.1, a message with trace t is sent also in t is sent in R(t) R(r); as r ∈ C(s), a message with trace t is sent also in R(r). That is, for each preﬁx t ∈ E(A) of a cyclic permutation of s, there will be a message sent with trace t. The number of such preﬁxes t is by deﬁnition m(s, E(A)). 䊏 Let I = {s1 , s2 , . . . , sn } be the set of ids, and Perm(I ) be the set of permutations of I . Assuming that all n! permutations in Perm(I ) are equally likely, the average number aveA (I ) of messages sent by A in the rings labeled by I will be the average message cost of A among the rings R(s), where s ∈ Perm(I ). By Lemma 3.3.3, this means the following: 1 aveA (I ) ≥ n! m(s, E(A)). s∈Perm(I )

By deﬁnition of mk (s, E(A)), we have aveA (I ) ≥

1 n!

n

mk (s, E(A)) =

s∈Perm(I ) k=1

1 n!

n

mk (s, E(A)).

k=1 s∈Perm(I )

We need to determine what s∈Perm(I ) mk (s, E(A)) is. Fix k and s ∈ Perm(I ). Each cyclic permutation C(s) of s has only one preﬁx of length k. In total, there are n preﬁxes of length k among all the cyclic permutations of s ∈ Perm(I ). As there are n! elements in Perm(I ), there are n! n instances of such preﬁxes for a ﬁxed k. These n! n preﬁxes can be partitioned in groups Gkj of size k, by putting together all the cyclic permutations of the same sequence; there will be q = n!k n such groups. As E(A) is exhaustive, by the cyclic permutation property, the set E(A) intersects each group, that is, |E(A) ∪ Gkj | ≥ 1.

mk (s, E(A)) ≥

s∈Perm(I )

q j =1

|E(A) ∪ Gkj | ≥

n!n k .

Thus, aveA (I ) ≥

1 n!

n k=1

n!n k

≥n

n k=1

1 k

= nHn ,

where Hn is the nth harmonic number. This lower bound on the average case is also a lower bound on the number worstA (I ) of messages sent by A in the worst case in the rings labeled by I : worstA (I ) ≥ aveA (I ) ≥ nHn ≈ 0.69 n log n + O(n).

(3.28)

This result states that ⍀(n logn) messages are needed in the worst case by any solution protocol (the bound is true for every A), even if there is Message Ordering. Thus, any improvement we can hope to obtain by clever design will at most reduce the constant; in any case, the constant cannot be smaller than 0.69. Also, we cannot expect

ELECTION IN RINGS

155

to design election protocols that might have a bad worst case but cost dramatically less on an average. In fact, ⍀(n logn) messages are needed on an average by any protocol. Notice that the lower bound we have established can be achieved. In fact, protocol AsFar requires on an average nHn messages (Theorem 3.3.1). In other words, protocol AsFar is optimal on an average. If the entities know n, it might be possible to develop better protocols exploiting this knowledge. In fact, the lower bound in this case leaves a little more room but again the improvement can only be in the constant (Exercise 3.10.45): worstA (I |n known) ≥ aveA (I |n known) ≥

1 −ε 4

n log n.

(3.29)

So far no better protocol is known. Bidirectional Rings In bidirectional rings, the lower bound is slightly different in both derivation and value (Exercise 3.10.46): worstA (I ) ≥ aveA (I ) ≥

1 nHn ≈ 0.345 n log n + O(n). 2

(3.30)

Actually, we can improve this bound even if the entities know n (Exercise 3.10.47): worstA (I : n known) ≥ aveA (I : n known) ≥

1 n log n. 2

(3.31)

That is, even with the additional knowledge of n, any improvement can only be in the constant. So far, no better protocol is known. Practical and Theoretical Implications The lower bounds we have discussed so far indicate that ⍀(n log n) messages are needed both in the worst case and on the average, regardless of whether the ring is unidirectional or bidirectional, and whether n is known or not. The only difference between these cases will be in the constant. In the previous sections, we have seen several protocols that use O(n log n) messages in the worst case (and are thus optimal); their cost provides us with upper bounds on the complexity of leader election in a ring. If we compare the best upper and lower bounds for unidirectional rings with those for bidirectional rings, we notice the existence of a very surprising situation: The bounds for unidirectional rings are “better” than those for bidirectional ones; the upper bound is smaller and the lower bound is bigger (see Fig. 3.33 and 3.34). This fact has strange implications: As far as electing a leader in a ring is concerned, unidirectional rings seem to be better systems than bidirectional ones, which in turn implies that practically half-duplex links are better than full-duplex links.

156

ELECTION

bidirectional All the Way AsFar ProbAsFar Control Stages StagesFbk Alternate BiMinMax lower bound

worst case n2 n2 n2 6.31n log n + O(n) 2n log n + O(n) 1.89n log n + O(n) 1.44n log n + O(n) 1.44n log n + O(n)

average n2 0.69n log n + O(n) 0.49n log n + O(n)

notes

oriented ring 0.5n log n + O(n)

n = 2p known

FIGURE 3.33: Summary of bounds for bidirectional rings.

This is clearly counterintuitive: In terms of communication hardware, Bidirectional Links are clearly more powerful than half-duplex links. On the contrary, the bounds are quite clear: Election protocols for unidirectional rings are more efﬁcient than those for bidirectional ones. A natural reaction to this strange status of affairs is to suggest the use in bidirectional rings of unidirectional protocols; after all, with Bidirectional Links we can send in both directions, “left” and “right,” so we can just decide to use only one, say “right.” Unfortunately, this argument is based on the hidden assumption that the bidirectional ring is also oriented, that is, “right” means the same to all processors. In other words, it assumes that the labeling of the port numbers, which is purely local, is actually globally consistent. This explains why we cannot use the (more efﬁcient) unidirectional protocol in a generic bidirectional ring. But why should we do better in unidirectional rings? The answer is interesting—In a unidirectional ring, there is orientation: Each entity has only one out-neighbor; so there is no ambiguity as to where to send a message. In other words, we have discovered an important principle of the nature of distributed computing: Global consistency is more important than hardware communication power.

unidirectional All the Way AsFar UniStages UniAlternate MinMax MinMax+ lower bound lower bound

worst case n2 n2 2n log n + O(n) 1.44n log n + O(n) 1.44n log n + O(n) 1.271n log n + O(n)

average n2 0.69n log n + O(n)

0.69n log n + O(n) 0.25n log n + O(n)

notes

n = 2p known

FIGURE 3.34: Summary of bounds for unidirectional rings.

ELECTION IN RINGS

157

This principle is quite general. In the case of rings, the difference is not much, just in the multiplicative constant. As we will see in other topologies, this difference can actually be dramatic. If the ring is both bidirectional and oriented, then we can clearly use any unidirectional protocol as well as any bidirectional one. The important question is whether in this case we can do better than that. That is, the quest is for a protocol for bidirectional oriented rings that 1. fully exploits the power of both full-duplex links and orientation; 2. cannot be used or simulated in unidirectional rings, nor in general bidirectional ones; and 3. is more efﬁcient than any unidirectional protocol or general bidirectional one. We have seen a protocol for oriented rings, Alternate; however, it can be simulated in unidirectional rings (protocol UniAlternate). To date, no protocol with such properties is known. It is not even known whether it can exist (Problem 3.10.7). 3.3.9 Summary and Lessons We have examined the design of several protocols for leader election in ring networks and analyzed the effects that design decisions have had on the costs. When developing the election protocols, we have introduced some key strategies that are quite general in nature and, thus, can be used for different problems and for different networks. Among them are the idea of electoral stages and the concept of controlled distances. We have also employed ideas and tools, for example, feedback and notiﬁcation, already developed for other problems. In terms of costs, we have seen that ⌰(n log n) messages will be used both in the worst case and on the average, regardless of whether the ring is unidirectional or bidirectional, oriented or unoriented, and n is known or not. The only difference is in the multiplicative constant. The bounds are summarized in Figures 3.33 and 3.34. As a consequence of these bounds, we have seen that orientation of the ring is, so far, more powerful than presence of Bidirectional Links. Both ring networks and tree networks have very sparse topologies: m = n − 1 in trees and m = n in rings. In particular, if we remove any single link from a ring, we obtain a tree. Still, electing a leader costs ⌰(n log n) in rings but only ⌰(n) in trees. The reason for such a drastic complexity difference has to be found not in the number of links but instead in the properties of the topological structure of the two types of networks. In a tree, there is a high level of asymmetry: We have two types of nodes internal nodes and leaves; it is by exploiting such asymmetry that election can be performed in a linear number of messages. On the contrary, a ring is a highly symmetrical structure, where every node is indistinguishable from another. Consider that the election task is really a task of breaking symmetry: We want one entity to become different from all others. The entities already have a behavioral symmetry: They all have the same set of rules and the same initial state, and potentially they

158

ELECTION

are all initiators. Thus, the structural symmetry of the ring topology only makes the solution to the problem more difﬁcult and more expensive. This observation reﬂects a more general principle: As far as election is concerned, structural asymmetry is to the protocol designer’s advantage; on the contrary, the presence of structural symmetry is an obstacle for the protocol designer. 3.4 ELECTION IN MESH NETWORKS Mesh networks constitute a large class of architectures that includes meshes and tori; this class is popular especially for parallel systems, redundant memory systems, and interconnection networks. These networks, like trees and rings, are sparse: m = O(n). Using our experience with trees and rings, we will now approach the election problem in such networks. Unless otherwise stated, we will consider Bidirectional Links. 3.4.1 Meshes A mesh M of dimensions a × b has n = a × b nodes, xi,j , 1 ≤ i ≤ a, 1 ≤ j ≤ b. Each node xi,j is connected to xi−1,j , xi,j −1 , xi+1,j , xi,j +1 if they exist; let us stress that these names are used for descriptive purposes only and are not known to the entities. The total number of links is thus m = a(b − 1) + b(a − 1) = 2ab − a − b (see Figure 3.35). Observe that in a mesh, we have three types of nodes: corner (entities with only two neighbors), border (entities with three neighbors), and interior (with four neighbors) nodes. In particular, there are four corner nodes, 2(a + b) border nodes, and n − 2(a + b − 2) interior nodes. Unoriented Mesh The asymmetry of the mesh can be exploited to our advantage when electing a leader: As it does not matter which entity becomes leader, we can elect one of the four corner nodes. In this way, the problem of choosing a leader among (possibly) n nodes is reduced to the problem of choosing a leader among the x1,1

x4,5 FIGURE 3.35: Mesh of dimension 4 × 5.

ELECTION IN MESH NETWORKS

159

four corner nodes. Recall that any number of nodes can start (each unaware of when and where the others will start, if at all); thus, to achieve our goal, we need to design a protocol that ﬁrst of all makes the corners aware of the election process (they might not be initiators at all) and then performs the election among them. The ﬁrst step, to make the corners aware, can be performed doing a wake-up of all entities. When an entity wakes up (spontaneously if it is an initiator, upon receiving a wake-up message otherwise), its subsequent actions will depend on whether it is a corner, a border, or an interior node. In particular, the four corners will become awake and can start the actual election process. Observe the following interesting property of a mesh: If we consider only the border and corner nodes and the links between them, they form a ring network. We can, thus, elect a leader among the corners by using a election protocol for rings: The corners will be the only candidates; the borders will act as relayers (defeated nodes). When one of the corner nodes is elected, it will notify all other entities of termination. Summarizing, the process will consist of: 1. wake-up, started by the initiators; 2. election (on outer ring), among the corners; 3. notiﬁcation (i.e., broadcast) started by the leader; Let us consider these three activities individually. (1) Wake up is straightforward. Each of the k initiators will send a wake-up to all its neighbors; a noninitiator will receive the wake-up message from a neighbor and forward it to all its other neighbors (no more than three); hence the number of messages (Exercise 3.10.48) will be no more than 3n + k . (2) The election on the outer ring requires a little more attention. First of all, we must choose which ring protocol we will use; clearly, the selection is among the efﬁcient ones we have discussed at great length in the preceding sections. Then we must ensure that the messages of the ring election protocol are correctly forwarded along the links of the outer ring. Let us use protocol Stages and consider the ﬁrst stage. According to the protocol, each candidate (in our case, a corner node) sends a message containing its value in both directions in the ring; each defeated entity (in our case, a border node) will forward the message along the (outer) ring. Thus, in the mesh, each corner node will send a message to the only two neighbors. A border node y, however, has three neighbors, of which only two are in the outer ring; when y receives the message, it does not know to which of the other two ports it must forward the message. What we will do is simple; as we do not know to which port the message must be sent, we will forward it to both: One will be along the ring and proceed safely, and the other will instead reach an interior node z; when the

160

ELECTION

interior node z receives such an election message, it will reply to the border node y “I am in the interior,” so no subsequent election messages are sent to it. Actually, it is possible to avoid those replies without affecting the correctness (Exercise 3.10.50). In Stages, the number of candidates is at least halved every time. This means that after the second stage, one of the corners will determine that it has the smallest id among the four candidates and will become leader. Each stage requires 2n messages, where n = 2(a + b − 2) is the dimension of the outer ring. An additional 2(a + b − 4) messages are unknowingly sent by the border to the interior in the ﬁrst stage; there are also the 2(a + b − 4) replies from those interior nodes, that, however, can be avoided (Exercise 3.10.50). Hence, the number of messages for the election process will be at most 4(a + b − 2) + 2(a + b − 4) = 6(a + b) − 16. IMPORTANT. Notice that in a square √mesh (i.e., a = b), this means that the election process proper can be achieved in O( n) messages. (3) Broadcasting the notiﬁcation can be performed using Flood, which will require less than 3n messages as it is started by a corner. Actually, with care, we can ensure that less than 2n messages are sent in total (Exercise 3.10.49). Thus in total, the protocol ElectMesh we have designed will have cost 6(a + b) + 5n + k − 16. With a simple modiﬁcation to the protocol, it is possible to save an additional 2(a + b − 4) messages (Exercise 3.10.51), achieving a cost of at most M[ElectMesh] ≤ 4(a + b) + 5n + k − 32.

(3.32)

NOTE. The most expensive operation is to wake up the nodes. Oriented Mesh A mesh is called oriented if the port numbers are the traditional compass labels (north, south, east, west) assigned in a globally consistent way. This assignment of labels has many important properties, in particular, one called sense of direction that can be exploited to obtain efﬁcient solutions to problems such as broadcast and traversal (Problems 3.10.52 and 3.10.53). For the purposes of election, in an oriented mesh, it is trivial to agree on a unique node. For example, there is only one corner with link labels “south” and “west.” Thus, to elect a leader in an oriented mesh, we must just ensure that that unique node knows that it must become leader. In other words, the only part needed is a wake-up: Upon becoming awake, and participating in the wake-up process, an entity can immediately become leader or follower depending on whether or not it is southwest corner.

ELECTION IN MESH NETWORKS

161

Notice that in an oriented mesh, we can exploit the structure of the mesh and the orientation to perform a wakeup with fewer than 2n messages (Problem 3.10.54). Complexity These results mean that regardless of whether the mesh is oriented or not, a leader can be elected with O(n) messages, the difference being solely in the multiplicative constant. As no election protocol for any topology can use fewer than n messages, we have Lemma 3.4.1

M(Elect/IR ; Mesh) = ⌰(n)

3.4.2 Tori Informally, the torus is a mesh with “wrap-around” links that transform it into a regular graph: Every node has exactly four neighbors. A torus of dimensions a × b has n = ab nodes vi,j (0 ≤ i ≤ a − 1,0 ≤ j ≤ b − 1); each node vi,j is connected to four nodes vi,j +1 , vi,j −1 , vi+1,j , and vi−1,j , where all the operations on the ﬁrst index are modulo a, while those on the second index are modulo b (e.g., see Figure 3.36). In the following sections, we will focus on square tori (i.e., where a = b). Oriented Torus We will ﬁrst develop an election protocol assuming that there is the compass labeling (i.e., the links are consistently labeled as north, south, east, and west, and the dimensions are known); we will then see how to solve the problem also when the labels are arbitrary. A torus with such a labeling is said to be oriented. In designing the election protocol, we will use the idea of electoral stages developed originally for ring networks and also use the defeated nodes in an active way. We will also employ a new idea, marking of territory. (I) In stage i, each candidate x must “mark” the boundary of a territory Ti (a di × di region of the torus), where di = α i for some ﬁxed constant α > 1; initially v0,0

v3,4

FIGURE 3.36: Torus of dimension 4 × 5.

162

ELECTION

Ti+2

Ti+1 y

Ti x

FIGURE 3.37: Marking the territory. If the territories of two candidates intersect, one of them will see the marking of the other.

the territory is just the single candidate node. The marking is done by originating a “Marking” message (with x’s value) that will travel to distance di ﬁrst north, then east, then south, and ﬁnally west to return to x. A very important fact is that if the territory of two candidates have some elements in common, the “Marking” message of at least one of them will encounter the marking of the other (Figure 3.37). (II) If the “Marking” message of x does not encounters any other marking of the same stage, x survives this stage, enters stage i + 1, and starts the marking of a larger territory Ti+1 . (III) If the “Marking” message arrives at a node w already marked by another candidate y in the same stage, the following will occur: 1. If y has a larger id, the “Marking” message will continue to mark the boundary, setting a boolean variable SawLarger to true. 2. If the id of y is instead smaller, then w will terminate the “Marking” message from x; it will then originate a message “SeenbyLarger(x, i)” that will travel along the boundary of y’ territory. If candidate x receives both its “Marking” message with SawLarger = true and a “SeenbyLarger” message, x survives this stage, enters stage i + 1, and starts the marking of a larger territory Ti+1 . Summarizing, for a candidate x to survive, it is necessary that it receives its “Marking” message back. If SawLarger = false, then that sufﬁces; if SawLarger = true, x must also receive a “SeenbyLarger” message. Note that if x receives a “SeenbyLarger(z, i)” message, then z did not ﬁnish marking its boundary; thus z does not survives this stage. In other words, if x survives, either its message found no other markings, or at least another candidate does not survive. 2

Distances include the starting node.

ELECTION IN MESH NETWORKS

163

(IV) A relay node w might receive several “Marking” messages from different candidates in the same stage. It will only be part of the boundary of the territory of the candidate with the smallest id. This means that if w was part of the boundary of some candidate x and now becomes part of the boundary of y, a subsequent “SeenbyLarger” message intended for x will be sent along the boundary of y. This is necessary for correctness. To keep the number of messages small, we will also limit the number of “SeenbyLarger” messages sent by a relayer. (V) A relay node will only forward one “SeenbyLarger” message. √ The algorithm continues in this way until di ≥ n. In this case, a candidate will receive its “Marking” message from south instead of east because of, the “wraparound” in the torus; it then sends the message directly east, and will wait for it to arrive from west. (VI) When a wrap-around is detected (receive its “Marking” message from south rather than from east), a candidate x sends the message directly east, and waits for it to arrive from west. If it survives, in all subsequent stages the marking becomes simpler. (VII) In every stage after wrap-around, a candidate x sends its “Marking” message ﬁrst north and waits to receive it from south, then it sends it east, and waits for it to arrive from west. The situation where there is only one candidate left will be for sure reached after a constant number p of stages after the wrap-around occurs, as we will see later. (VIII) If a candidate x survives p stages after wrap-around, it will become leader and notify all other entities of termination. Let us now discuss the correctness and cost of the algorithm, protocol MarkBoundary, we have just described. Correctness and Cost For the correctness, we need to show progress, that is, at least one candidate survives each stage of the algorithm, and termination, that is, p stages after wrap-around there will be only one candidate left. Let us discuss progress ﬁrst. A candidate whose “Marking” message does not encounter any other boundary will survive this stage; so the only problem would be if, in a stage, every “Marking” message encounters another candidate’s boundary, and somehow none of them advances. We must show that this cannot happen. In fact, if every “Marking” message encounters another candidate’s boundary, the one with the largest id will encounter a smaller id; the candidate with this smaller id will go onto the next stage unless its message encounters the boundary with an even smaller id, and so on; however, the message of the candidate with the smallest id cannot encounter a larger id (because it is the smallest) and, thus, that entity would survive this stage. For termination, the number of candidates does decrease overall, but not in a simple way. However, it is possible to bound the maximum number of candidates

164

ELECTION

in each stage, and that bound strictly decreases. Let ni be the maximum number of candidates in stage i. Up until wrap-around, there are two types of survivors: (a) those entities whose message did not encounter any border and (b) those whose message encountered a border with a larger id and whose border was encountered by a message with a larger id. Let ai denote the number of the ﬁrst type of survivors; clearly ai ≤ n/di2 . The number of the second type will be at most (ni − ai )/2 as each defeated one can cause at most one candidate to survive. Thus, ni+1 ≤ ai + (ni − ai )/2 = (ni + ai )/2 ≤ ni +

n di2

/2.

As di = α i is increasing each stage, the upper bound ni on the number of candidates is decreasing. Solving the recurrence relation gives ni+1 ≤ n/α 2i (2 − α 2 ).

(3.33)

√ Wrap-around occurs when α i ≥ n; in that stage, only one candidate can complete the marking of its boundary without encountering any markings and at most half the remaining candidates will survive. So, the number of candidates surviving this stage is at most (2 − α 2 )−1 . In all subsequent stages, again only one candidate can complete the marking without encountering any markings and at most half the remaining candidates will survive. Hence, after p > log(2 − α 2 )−1 additional stages for sure there will be only one candidate left. Thus, the protocol correctly terminates. To determine the total number of messages, consider that in stage i before wraparound, each candidate causes at most 4di “Marking” messages to mark its boundary and another 4di “SeenbyLarger” messages, for a total of 8di = 8α i messages; as the number of candidates is at most as expressed by equation 3.33, the total number of messages in this pre-wrap-around stage will be at most O(nα 2 /(2 − α 2 )(α − 1)). In each phase√ after wrap-around, there is only a constant number of candidates, each sending O( n) messages. As the number of√such phases is constant, the total number of messages sent after wrap-around is O( n). Choosing α ≈ 1.1795 yields the desired bound M[MarkBorder] = ⌰(n).

(3.34)

The preceding analysis ignores the fact that α i is not an integer: The distance to travel must be rounded up and this has to be taken into account in the analysis.

ELECTION IN MESH NETWORKS

165

However, the effect is not large and will just affect the low-order terms of the cost (Exercise 3.10.55). The algorithm as given is not very time efﬁcient. In fact, the ideal time can be as bad as O(n) (Exercise 3.10.56). The protocol can be, however, modiﬁed so that√without changing its message complexity, the algorithm requires no more than O( n) time (Exercise 3.10.57). The protocol we have described is tailored for square tori. If the torus is not square but rectangular with length l and width w (l ≤ w), then the algorithm can be adapted to use ⌰(n + l log l/w) messages (Exercise 3.10.58). Unoriented Torus The algorithm we just described solved the problem of electing a leader in an oriented torus, for example, among the buildings in Manhattan (well known for its mesh-like design), by sending a messenger along east-west streets and north-south avenues, turning at the appropriate corner. Consider now the same problem when the streets have no signs and the entities have no compass. Interestingly, the same strategy can be still used: A candidate needs to mark off a square; the orientation of the square is irrelevant. To be able to travel along a square, we just need to know how to 1. forward a message “in a straight line,” and 2. make the “appropriate turn.” We will discuss how to achieve each, separately. (1) Forwarding in a Straight Line. We ﬁrst consider how to forward a message in the direction opposite to the one from which the message was received, without knowing the directions. Consider an entity x, with its four incident links, and let a, b, c, and d be the arbitrary port numbers associated with them; (see Figure 3.38); to forward a message in a straight line, x needs to determine that a and d are opposite, and so are b and c. This can be easily accomplished by having each entity send its identity to each of its four neighbors, which will forward it to its three other neighbors; the entity will in turn acquire the identity and relative position of each entity at distance 2. As a result,

z

y a c

x

b

d

FIGURE 3.38: Even without a compass, x can determine which links are opposite.

166

ELECTION

x will know the two pairs of opposite port numbers. In the example of Figure 3.38, x will receive the message originating from z via both port a and port b; it, thus, knows that a is not opposite to b. It also receives the message from y via ports a and c; thus x knows also that a is not opposite to c. Then, x can conclude that a is opposite to d. It will then locally relabel one pair of opposite ports as east, west, and the other north, south; it does not matter which pair is chosen ﬁrst. (2) Making the Appropriate Turn. As a result of the the previous operation, each entity x knows two perpendicular directions, but the naming (north, south) and (east, west) might not be consistent with the one done by other entities. This can create problems when wanting to make a consistent turn. Consider a message, originating by x which is traveling “south” (according to x’s view of the torus); to continue to travel “south” can be easily accomplished as each entity knows how to forward a message in a straight line. At some point, according to the protocol, the message must turn, say to “east” (always according to x’s view of the torus), and continue in that direction. To achieve the turn correctly, we add a simple information, called handrail, to a message. The handrail is the id of the neighbor in the direction the message must turn and the name of the direction. In the example of Figure 3.38, if x is sending a message south that must then turn east, the handrail in the message will be the id of its eastern neighbor q plus the direction “east.” Because every entity knows the ids and the relative position of all the entities within distance 2, when y receives this message with the handrail from x, it can determine what x means by “east,” and thus in which direction the message must turn (when the algorithm prescribes it). Summarizing, even without a compass, we can execute the protocol MarkBorder, by adding the preprocessing phase and including the handrail information in the messages. The cost of the preprocessing is relatively small: Each entity receives four messages for its immediate neighbors and 4 × 3 for entities at distances 2, for a total of 16n messages.

3.5 ELECTION IN CUBE NETWORKS 3.5.1 Oriented Hypercubes The k-dimensional hypercube Hk , which we have introduced in Section 2.1.3, is a common interconnection network, consisting of n = 2k nodes, each with degree k; hence, in Hk there are m = k2k−1 = O(n log n) edges. In an oriented hypercube Hk , the port numbers 1, 2, . . . , k for the k edges incident on a node x are called dimensions and are assigned according to the “construction rules” specifying Hk (see Fig. 2.3). We will solve the election problem in oriented hypercubes using the approach electoral stages that we have developed for ring networks. The metaphor we will use is that of a fencing tournament: in a stage of the tournament, each candidate, called duelist, will be assigned another duelist, and each pair will have a match; as a result

ELECTION IN CUBE NETWORKS

167

of the match, one duelist will be promoted to the next stage, the other excluded from further competition. In each stage, only half of the duelists enter the next stage; at the end, there will be only one duelist that will become the leader and notify the others. Deciding the outcome of a match is easy: The duelist with the smaller id will win; for reasons that will become evident later, we will have the defeated duelist remember the shortest path to the winning duelist. The crucial and difﬁcult parts are how pairs of opposite duelists are formed and how a duelist ﬁnds its competitor. To understand how this can be done efﬁciently, we need to understand some structural properties of oriented hypercubes. A basic property of an oriented hypercube is that if we remove from Hk all the links with label greater than i (i.e., consider only the ﬁrst i dimensions), we are left with 2k−i disjoint oriented hypercubes of dimension i; denote the collection of these smaller cubes by Hk:i . For example, removing the links with label 3 and 4 from H4 will result into four disjoint oriented hypercubes of dimension 2 (see Figure 3.39 (a and b)). What we will do is to ensure that (I) at the end of stage i − 1, there will be only one duelist left in each of the oriented hypercubes of dimension i − 1 of Hk:i−1 . So, for example, at the end of stage 2, we want to have only one duelist left in each of the four hypercubes of dimension 2 (see Figure 3.39(c)). Another nice property of oriented hypercubes is that if we add to Hk:i−1 the links labeled i (and, thus, construct Hk:i ) the elements of Hk:i−1 will be grouped into pairs. We can use this property to form the pairs of duelists in each stage of the tournament: (II) A duelist x starting stage i will have as its opponent the duelist in the hypercube of dimension i − 1 connected to x by the link labeled i. Thus, in stage i, a duelist x will send a Match message to (and receive a Match message from) the duelist y in hypercube (of dimension i − 1) that is on the other side of link i. The Match message from x will contain the id id(x) (as well as the path traveled so far) and will be sent across dimension i (i.e., the link with label i). The entity z on the other end of the link might, however, not be the duelist y and might not even know who (and where) y is (Figure 3.40). We need the Match message from x to reach its opponent y. We can obtain this by having z broadcast the message in its (i − 1)-dimensional hypercube (e.g., using protocol HyperFlood presented in Section 2.1.3); in this way, we are sure that y will receive the message. Obviously, this approach is an expensive one (as determined in Exercise 3.10.59). To solve this problem efﬁciently, we will use the following observation. If node z is not the duelist (i.e., z = y), node z was defeated in a previous stage, say i1 < i; it knows the (shortest) path to the duelist zi1 , which defeated it in that stage, and can thus forward the message to it. Now, if zi1 = y, then we are done: The message from x has arrived and the match can take place. Otherwise, in a similar way, zi1 was

168

ELECTION

(a)

2

1 (b)

(c)

FIGURE 3.39: (a) The four-dimensional hypercube H4 , (b) the collection H4:2 of twodimensional hypercubes obtained by removing the links with labels greater than 2, and (c) duelists (in black) at the end of stage 2. z

y x

FIGURE 3.40: Each duelist (in black) sends a Match message that must reach its opponent.

ELECTION IN CUBE NETWORKS

169

defeated in some subsequent stage i2 , i1 < i2 < i; it, thus, knows the (shortest) path to the duelist zi2 , which defeated it in that stage and can thus forward the message to it. In this way, the message from x will eventually reach y; the path information in the message is updated during its travel so that y will know the dimensions traversed by the message from x to y in chronological order. The Match message from y will reach x with similar information. The match between x and y will take place both at x and y; only one of them, say x, will enter stage i + 1, while the other, y, is defeated. From now on, if y receives a Match message, it will forward it to x; as mentioned before, we need this to be done on the shortest path. How can y (the defeated duelist) know the shortest path to x (the winner)? The Match message y received from x contained the labels of a walk to it, not necessarily the shortest path. Fortunately, it is easy to determine the shortcuts in any path using the properties of the labeling. Consider a sequence α of labels (with or without repetitions); remove from the sequence any pair of identical labels and sort the remaining ones, obtaining a compressed sequence α. For example, if α = 231345212, then α = 245. The important property is that if we start from the same node x, the walk with labels α will lead to the same node y as the walk with labels α. The other important property is that α actually corresponds to the shortest path between x and y. Thus, y needs only to compress the sequence contained in the Match message sent by x. IMPORTANT. We can perform the compression while the message is traveling from x to y; in this way, the message will contain at most k labels. Finally, we must consider the fact that owing to different transmission delays, it is likely that the computation in some parts of the hypercube is faster than in others. Thus, it may happen that a duelist x in stage i sends a Match message for its opponent, but the entities on the other side of dimension i are still in earlier stages. So, it is possible that the message from x reaches a duelist y in an earlier stage j < i. What y should do with this message depends on future events that have nothing to do with the message: If y wins all matches in stages j, j + 1, . . . , i − 1, then y is the opponent of x in stage i, and it is the destination of the message; on the contrary, if it loses one of them, it must forward the message to the winner of that match. In a sense, the message from x has arrived “too soon”; so, what y will do is to delay the processing of this message until the “right” time, that is, until it enters stage i or it becomes defeated. Summarizing, 1. A duelist in stage i will send a Match message on the edge with label i. 2. When a defeated node receives a Match message, it will forward it to the winner of the match in which it was defeated. 3. When a duelist y in stage i receives a Match message from a duelist x in stage i, if id(x) > id(y), then y will enter stage i + 1, otherwise it will become defeated and compute the shortest path to x.

170

ELECTION

4. When a duelist y in stage j receives a Match message from a duelist x in stage i > j , y will enqueue the message and process it (as a newly arrived one) when it enters stage i or becomes defeated. The protocol terminates when a duelist wins the kth stage. As we will see, when this happens, that duelist will be the only one left in the network. The algorithm, protocol HyperElect, is shown in Figures 3.41 and 3.42. NextDuelist denotes the (list of labels on the) path from a defeated node to the duelist that defeated it. The Match message contains (Id*, stage*, source*, dest*), where Id* is the identity of the duelist x originating the message; stage* is the stage of this match; source* is (the list of labels on) the path from the duelist x to the entity currently processing the message; and dest* is (the list of labels on) the path from the entity currently processing the message to a target entity (used to forward message by the shortest path between a defeated entity and its winner). Given a list of labels list, the protocol uses the following functions: – ﬁrst(list) returns the ﬁrst element of the list; – list ⊕ i (respectively, ) updates the given path by adding (respectively, eliminating) a label i to the list and compressing it. To store the delayed messages, we use a set Delayed that will be kept sorted by stage number; for convenience, we also use a set delay of the corresponding stage numbers. Correctness and termination of the protocol derive from the following fact (Exercise 3.10.61): Lemma 3.5.1 Let id(x) be the smallest id in one of the hypercubes of dimension i in Hk:i . Then x is a duelist at the beginning of stage i + 1. This means that when i = k, there will be only one duelist left at the end of that stage; it will then become leader and notify the others so to ensure proper termination. To determine the cost of the protocol, we need to determine the number of messages sent in a stage i. For a defeated entity z, denote by w(z) its opponent (i.e., the one that won the match). For simplicity of notation, let wj (z) = w(wj −1 (z)) where w0 (z) = z. Consider an arbitrary H ∈ Hk:i−1 ; let y be the only duelist in H in stage i and let z be the entity in H that receives ﬁrst the Match message for y from its opponent. Entity z must send this message to y; it forwards the message (through the shortest path) to w(z), which will forward it to w(w(z)) = w2 (z), which will forward it to w(w2 (z)) = w3 (z), and so on, until wt (z) = y. There will be no more than i such “forward” points (i.e., t ≤ i); as we are interested in the worst case, assume this to be the case. Thus, the total cost will be the sum of all the distances between successive forward points, plus one (from x to z). Denote by d(j − 1, j ) the distance between wj −1 (z) and wj (z); clearly d(j − 1, j ) ≤ j (Exercise 3.10.60); then the total number of messages required for the Match message from a duelist x in stage i to reach its

ELECTION IN CUBE NETWORKS

PROTOCOL HyperElect.

States: S = {ASLEEP, DUELLIST, DEFEATED, FOLLOWER, LEADER}; SINIT = {ASLEEP}; STERM = {FOLLOWER, LEADER}.

Restrictions: IR ∪OrientedHypercube. ASLEEP

Spontaneously begin stage:= 1; delay:=0; value:= id(x); Source:= [stage]; Dest:= []; send("Match", value, stage, Source, Dest) to 1; become DUELLIST; end Receiving("Match", value*, stage*, Source*, Dest*) begin stage:= 1; value:= id(x); Source:= [stage]; Dest:= []; send("Match", value, stage, Source, Dest) to 1; become DUELLIST; if stage* =stage then PROCESS MESSAGE; else DELAY MESSAGE; endif end DUELLIST Receiving("Match", value*, stage*, Source*, Dest*) begin if stage* =stage then PROCESS MESSAGE; else DELAY MESSAGE; endif end DEFEATED Receiving("Match", value*, stage*, Source*, Dest*) begin if Dest* = [ ] then Dest*:= NextDuelist; endif l:=first(Dest*); Dest:=Dest* l; Source:= Source* ⊕l; send("Match", value*, stage*, Source, Dest) to l; end Receiving("Notify") begin send ("Notify") to {l ∈ N (x) : l > sender}; become FOLLOWER; end

FIGURE 3.41: Protocol HyperElect.

171

172

ELECTION

Procedure PROCESS MESSAGE begin if value* > value then if stage* =k then send ("Notify") to N (x); become LEADER; else stage:= stage+1; Source:=[stage] ; dest:= [ ]; send("Match", value, stage, Source, Dest) to stage; CHECK; endif else NextDuelist := Source; CHECK ALL; become DEFEATED; endif end

Procedure DELAY MESSAGE begin Delayed ⇐ (value*, stage*, Source*, Dest*); delay ⇐ stage*; end

Procedure CHECK begin if Delayed = ∅ then next:=Min{delay}; if next = stage then (value*, stage*, Source*, Dest*) ⇐ Delayed; delay:= delay-{next}; PROCESS MESSAGE endif endif end

Procedure CHECK ALL begin while Delayed = ∅ do (value*, stage*, Source*, Dest*) ⇐ Delayed; if Dest* [ ] then Dest*:= NextDuelist; endif l:=f irst(Dest*) ; Dest:=Dest* l ; Source:= Source* ⊕l send("Match", value*, stage*, Source, Dest) to l; endwhile end

FIGURE 3.42: Procedures used by Protocol HyperElect.

opposite y will be at most L(i) = 1 +

i−1

d(j − 1, j ) = 1 +

j =1

i−1 j =1

j =1+

i·(i−1) 2 .

Now we know how much does it cost for a Match message to reach its destination. What we need to determine is how many such messages are generated in each stage;

ELECTION IN CUBE NETWORKS

173

in other words, we want to know the number ni of duelists in stage i (as each will generate one such message). By Lemma 3.5.1, we know that at the beginning of stage i, there is only one duelist in each of the hypercubes H ∈ Hk:i−1 ; as there are exactly n = 2k−i+1 such cubes, 2i−1 ni = 2k−i+1 . Thus, the total number of messages in stage i will be

ni L(i) = 2k−i+1 1 +

i·(i−1) 2

and over all stages, the total will be k i=1

2k−i+1 1 +

i·(i−1) 2

= 2k

k i=1

i 2i−1

+

k i=1

i2 2i

+

k i=1

i 2i

= 6 2k − k 2 − 3k − 7.

As 2k = n, and adding the (n − 1) messages to broadcast the termination, we have M[HyperElect] ≤ 7n − (log n)2 − 3 log n − 7.

(3.35)

That is, we can elect a leader in less than 7n messages! This result should be contrasted with the fact that in a ring we need ⍀(n log n) messages. As for the time complexity, it is not difﬁcult to verify that protocol HyperFlood requires at most O(log3 N ) ideal time (Exercise 3.10.62). Practical Considerations The O(n) message cost of protocol HyperElect is achieved by having the Match messages convey path information in addition to the usual id and stage number. In particular, the ﬁelds Source and Dest have been described as lists of labels; as we only send compressed paths, Source and Dest contain at most log n labels each. So it would appear that the protocol requires “long” messages. We will now see that in practice, each list only requires log n bits (i.e., the cost of a counter). Examine a compressed sequence of edge labels α in Hk (e.g., α = 1457 in H8 ); as the sequence is compressed, there are no repetitions. The elements in the sequence are a subset of the integers between 1 and k; thus α can be represented as a binary string b1 , b2 , . . . , bk where each bit bj = 1 if and only if j is in α. Thus, the list α = 1457 in H8 is uniquely represented as 10011010. Thus, each of Source and Dest will be just a k = log n bits variable. This also implies that the cost in terms of bits of the protocol will be no more than B[HyperElect] ≤ 7n(log id + 2 log n + log log n), where the log log n component is to account for the stage ﬁeld.

(3.36)

174

ELECTION

3.5.2 Unoriented Hypercubes Hypercubes with arbitrary labellings obviously do not have the properties of oriented hypercubes. It is still possible to take advantage of the highly regular structure of hypercubes to do better than in ring networks. In fact (Problem 3.10.8), Lemma 3.5.2

M(Elect/IR; Hypercube) ≤ O(n log log n)

To date, it is not known whether it is possible to elect a leader in an hypercube in just O(n) messages even when it is not oriented (Problem 3.10.9).

3.6 ELECTION IN COMPLETE NETWORKS We have seen how structural properties of the network can be effectively used to overcome the additional difﬁculty of operating in a fully symmetric graph. For example, in oriented hypercubes, we have been able to achieve O(n) costs, that is, comparable to those obtainable in trees. In contrast, a ring has very few links and no additional structural property capable of overcoming the disadvantages of symmetry. In particular, it is so sparse (i.e., m = n) that it has the worst diameter among regular graphs (to reach the furthermost node, a message must traverse d = n/2 links) and no short cuts. It is thus no surprising that election requires ⍀(n log n) messages. The ring is the sparsest network and it is an extreme in the spectrum of regular networks. At the other end of the spectrum lies the complete graph Kn ; in Kn , each node is connected directly to every other node. It is thus the densest network m = 21 n(n − 1) and the one with smallest diameter d = 1. Another interesting property is that Kn contains every other network G as a subgraph! Clearly, physical implementation of such a topology is very expensive. Let us examine how to exploit such very powerful features to design an efﬁcient election protocol. 3.6.1 Stages and Territory To develop an efﬁcient protocol for election in complete networks, we will use electoral stages as well as a new technique, territory acquisition. In territory acquisition, each candidate tries to “capture” its neighbors (i.e., all other nodes) one at a time; it does so by sending a Capture message containing its id as well as the number of nodes captured so far (the stage). If the attempt is successful, the attacked neighbor becomes captured, and the candidate enters the next stage and

ELECTION IN COMPLETE NETWORKS

175

continues; otherwise, the candidate becomes passive. The candidate that is successful in capturing all entities becomes the leader. Summarizing, at any time an entity is candidate, captured, or passive. A captured entity remembers the id, the stage, and the link to its “owner” (i.e., the entity that captured it). Let us now describe an electoral stage. 1. A candidate entity x sends a Capture message to a neighbor y. 2. If y is candidate, the outcome of the attack depends on the stage and the id of the two entities: (a) If stage(x) > stage(y), the attack is successful. (b) If stage(x) = stage(y), the attack is successful if id(x) < id(y); otherwise x becomes passive. (c) If stage(x) < stage(y), x becomes passive. 3. If y is passive, the attack is successful. 4. If y is already captured, then x has to defeat y’s owner z before capturing y. Speciﬁcally, a Warning message with x’s id and stage is send by y to its owner z. (a) If z is a candidate in a higher stage, or in the same stage but with a smaller id than x, then the attack to y is not successful: z will notify y that, in turn, will notify x. (b) In all other cases (z is already passive or captured, z is a candidate in a smaller stage, or in the same stage but with a larger id than x), the attack to y is successful: z notiﬁes x via y, and if candidate it becomes passive. 5. If the attack is successful, y is captured by x, x increments stage(x) and proceeds with its conquest. Notice that each attempt from a candidate costs exactly two messages (one for the Capture, one for the notiﬁcation) if the neighbor is also a candidate or passive; instead, if the neighbor was already captured, two additional messages will be sent (from the neighbor to its owner, and back). The strategy just outlined will indeed solve the election problem (Exercise 3.10.65). Even though each attempt costs only four (or fewer) messages, the overall cost can be prohibitive; this is because of the fact that the number ni of candidates at level i can in general be very large (Exercise 3.10.66). To control the number ni , we need to ensure that a node is captured by at most one candidate in the same level. In other words, the territories of the candidates in stage i must be mutually disjoint. Fortunately, this can be easily achieved. First of all, we provide some intelligence and decisional power to the captured nodes: (I) If a captured node y receives a Capture message from a candidate x that is in a stage smaller than the one known to y, then y will immediately notify x that the attack is unsuccessful.

176

ELECTION

As a consequence, a captured node y will only issue a Warning for an attack at the highest level known to y. A more important change is the following: (II) If a captured node y sends a Warning to its owner z about an attack from x, y will wait for the answer from z (i.e., locally enqueue any subsequent Capture message in same or higher stage) before issuing another Warning. As a consequence, if the attack from x was successful (and the stage increased), y will send to the new owner x any subsequent Warning generated by processing the enqueued Capture messages. After this change, the territory of any two candidates in the same level are guaranteed to have no nodes in common (Exercise 3.10.64). Protocol CompleteElect implementing the strategy we have just designed is shown in Figures 3.43, 3.44, and 3.45. Let us analyze the cost of the protocol. How many candidates there can be in stage i? As each of them has a territory of size i and these territories are disjoint, there cannot be more than ni ≤ n/ i such candidates. Each will originate an attack that will cost at most four messages; thus, in stage i, there will be at most 4n/i messages. Let us now determine the number of stages needed for termination. Consider the following fact: if a candidate has conquered a territory of size n2 + 1, no other candidate can become leader. Hence, a candidate can become leader as soon as it reaches that stage (it will then broadcast a termination message to all nodes). Thus the total number of messages, including the n − 1 for termination notiﬁcation, will be n+1+

n/2

4ni ≤ n + 1 + 4n

i=1

n/2 i=1

1 i

= 4nHn/2 + n + 1,

which gives the overall cost M[CompleteElect] ≤ 2.76 n log n − 1.76n + 1.

(3.37)

Let us now consider the time cost of the protocol. It is not difﬁcult to see that in the worst case, the ideal time of protocol CompleteElect is linear (Exercise 3.10.67): T[CompleteElect] = O(n).

(3.38)

This must be contrasted with the O(1) time cost of the simple strategy of each entity sending its id immediately to all its neighbors, thus receiving the id of everybody else, and determining the smallest id. Obviously, the price we would pay for a O(1) time cost is O(n2 ) messages. Appropriately combining the two strategies, we can actually construct protocols that offer optimal O(n log n) message costs with O(n/ log n) time (Exercise 3.10.68). The time can be further reduced at the expense of more messages. In fact, it is possible to design an election protocol that, for any log n ≤ k ≤ n, uses O(nk) messages and O(n/k) time in the worst case (Exercise 3.10.69).

ELECTION IN COMPLETE NETWORKS

177

PROTOCOL CompleteElect.

S = {ASLEEP, CANDIDATE,PASSIVE, CAPTURED, FOLLOWER, LEADER}; SINIT = {ASLEEP}; STERM = {FOLLOWER, LEADER}.

Restrictions: IR ∪CompleteGraph. ASLEEP

Spontaneously begin stage:= 1; value:= id(x); Others:= N (x); next ← Others; send("Capture", stage, value) to next; become CANDIDATE; end Receiving("Capture", stage*, value*) begin send("Accept", stage*, value*) to sender; stage:= 1; owner:= sender; ownerstage:= stage* +1; become CAPTURED; end CANDIDATE Receiving("Capture", stage*, value*) begin if (stage* < stage) or ((stage* = stage) and (value* > value)) then send("Reject", stage) to sender; else send("Accept", stage*, value*) to sender; owner:= sender; ownerstage:= stage* +1; become CAPTURED; endif end Receiving("Accept", stage, value) begin stage:= stage+1; if stage ≥ 1 + n/2 then send("Terminate") to N(x); become LEADER; else next ← Others; send("Capture", stage, value) to next; endif end (CONTINUES ...)

FIGURE 3.43: Protocol CompleteElect (I).

3.6.2 Surprising Limitation We have just developed an efﬁcient protocol for election in complete networks. Its cost is O(n log n) messages. Observe that this is the same as we were able to do in ring networks (actually, the multiplicative constant here is worse).

178

ELECTION

CANDIDATE Receiving("Reject", stage*) begin become PASSIVE; end Receiving("Terminate") begin become FOLLOWER; end Receiving("Warning", stage*, value*) begin if (stage* < stage) or ((stage* = stage) and (value* > value)) then send("No", stage) to sender; else send("Yes", stage*) to sender; become PASSIVE; endif end PASSIVE Receiving("Capture", stage*, value*) begin if (stage* < stage) or ((stage* = stage) and (value* > value)) then send("Reject", stage) to sender; else send("Accept", stage*, value*) to sender; ownerstage:= stage* +1; owner:= sender; become CAPTURED; endif end Receiving("Warning", stage*, value*) begin if (stage* < stage) or ((stage* = stage) and (value* > value)) then send("No", stage) to sender; else send("Yes", stage*) to sender; endif end Receiving("Terminate") begin become FOLLOWER; end (CONTINUES ...)

FIGURE 3.44: Protocol CompleteElect (II).

Unlike rings, in complete networks, each entity has a direct link to all other entities and there is a total of O(n2 ) links. By exploiting all this communication hardware, we should be able to do better than in rings, where there are only n links, and where entities can be O(n) far apart.

ELECTION IN COMPLETE NETWORKS

179

CAPTURED Receiving("Capture", stage*, value*) begin if stage* < ownerstage then send("Reject", ownerstage) to sender; else attack:= sender; send("Warning", value*, stage*) to owner; close N (x) − {owner}; endif end Receiving("No", stage*) begin open N (x); send("Reject", stage*) to attack; end Receiving("Yes", stage*) begin ownerstage:= stage*+1; owner:= attack; open N (x); send("Accept", stage*, value*) to attack; end Receiving("Warning", stage*, value*) begin if (stage* < ownerstage) then send("No", ownerstage) to sender; else send("Yes", stage*) to sender; endif end Receiving("Terminate") begin become FOLLOWER; end

FIGURE 3.45: Protocol CompleteElect (III).

The most surprising result about complete networks is that in spite of having available the largest possible amount of connection links and a direct connection between any two entities, for election they do not fare better than ring networks. In fact, any election protocol will require in the worst case ⍀(n log n) messages, that is, Property 3.6.1 M(Elect/IR; K) = ⍀(n log n) To see why this is true, observe that any election protocol also solves the wake-up problem: To become defeated or leader, an entity must have been active (i.e., awake). This simple observation has dramatic consequences. In fact, any wake-up protocol requires at least .5n log n messages in the worst case (Property 2.2.5); thus, any Election protocol requires in the worst case the same number of messages.

180

ELECTION

This implies that as far as election is concerned, the very large expenses due to the physical construction of m = (n2 + n)/2 links are not justiﬁable as the same performance and operational costs can be achieved with only m = n links arranged in a ring.

3.6.3 Harvesting the Communication Power The lower bound we have just seen carries a very strong and rather surprising message for network development: in so far election is concerned, complete networks are not worth the large communication hardware costs. The facts that Election is a basic problem and its solutions are routinely used by more complex protocols makes this message even stronger. The message is surprising because the complete graph, as we mentioned, has the most communication links of any network and the shortest possible distance between any two entities. To overcome the limit imposed by the lower bound and, thus, to harvest the communication power of complete graphs, we need the presence of some additional tools (i.e., properties, restrictions, etc.). The question becomes: which tool is powerful enough? As each property we assume restricts the applicability of the solution, our quest for a powerful tool should be focused on the least restrictive ones. In this section, we will see how to answer this question. In the process, we will discover some intriguing relationships between port numbering and consistency and shed light on some properties of whose existence we already had an inkling in earlier section. We will ﬁrst examine a particular labeling of the ports that will allow us to make full use of the communication power of the complete graph. The ﬁrst step consists in viewing a complete graph Kn as a ring Rn , where any two nonneighboring nodes have been connected by an additional link, called chord. Assume that the label associated at x to link (x, y) is equal to the (clockwise) distance from x to y in the ring. Thus, each link in the ring is labeled 1 in the clockwise direction and n − 1 in the other. In general, if lx (x, y) = i, then ly (y, x) = n − i (see Figure 3.46); this labeling is called chordal. Let us see how election can be performed in a complete graph with such a labeling. First of all, observe the following: As the links labeled 1 and n − 1 form a ring, the entities could ignore all the other links and execute on this subnet an election protocol for rings, for example, Stages. This approach will yield a solution requiring 2n log n messages in the worst case, thus already improving on CompleteElect. But we can do better than that. Consider a candidate entity x executing stage i: It will send an election message each in both directions, which will travel along the ring until they reach another candidate, say y and z (see Figure 3.47). This operation will require the transmission of d(x, y) + d(x, z) messages. Similarly, x will receive the Election messages from both y and z, and decide whether it survives this stage or not, on the basis of the received ids.

ELECTION IN COMPLETE NETWORKS

1 4

4

1

3

2

181

4

2

3 2

3

2

3

1

3

2

4

1

4

1

FIGURE 3.46: A complete graph with chordal labeling. The links labeled 1 and 4 form a ring.

Now, in a complete graph, there exists a direct link between x and y, as well as between x and z; thus, a message from one to the other could be conveyed with only one transmission. Unfortunately, x does not know which of its n − 1 links connect it to y or to z; y and z are in a similar situation. In the example of Figure 3.47, x does not know that y is the node at distance 5 along the ring (in the clockwise direction), and thus the port connecting x to it is the one with label 5. If it did, those four defeated nodes in between them could be bypassed. Similarly, x does not know that z is at distance −3 (i.e., at distance 3 in the counterclockwise direction) and thus reachable through port n − 3. However, this information can be acquired. Assume that the Election message contains also a counter, initialized to one, which is increased by one unit by each node forwarding it. Then, a candidate receiving the Election message knows exactly which port label connects it to the originator of that message. In our example, the election message from y will have a counter equal to 5 and will arrive from link 1 (i.e., counterclockwise), while the message from z will

x n−3

5

z

y

FIGURE 3.47: If x knew d(x, y) and d(x, z), it could reach y and z directly.

182

ELECTION

have a counter equal to 3 and will arrive from link n − 1 (i.e., clockwise). From this information, x can determine that y can be reached directly through port 5 and z is reachable through link n − 3. Similarly, y (respective z) will know that the direct link to x is the one labeled n − 5 (respective 3). This means that in the next stage, these chords can be used instead of the corresponding segments of the ring, thus saving message transmissions. The net effect will be that in stage i + 1, the candidates will use the (smaller) ring composed only of the chords determined in the previous stage, that is, messages will be sent only on the links connecting the candidates of stage i, thus, completely bypassing all entities defeated in stage i − 1 or earlier. Assume in our example that x enters stage i + 1 (and thus both y and z are defeated); it will prepare an election message for the candidates in both directions, say u and v, and will send it directly to y and to z. As before, x does not know where u and v are (i.e., which of its links connect it to them) but, as before, it can determine it. The only difference is that the counter must be initialized to the weight of the chord: Thus, the counter of the Election message sent by x directly to y is equal to 5, and the one to z is equal to 3. Similarly, when an entity forwards the Election message through a link, it will add to the counter the weight of that link. Summarizing, in each stage, the candidates will execute the protocol in a smaller ring. Let R(i) be the ring used in stage i; initially R(1) = Rn . Using the ring protocol Stages in each stage, the number of messages we will be transmitting will be exactly 2(n(1) + n(2) + . . . + n(k)), where n(i) is the size of R(i) and k ≤ log n is the number of stages; an additional n − 1 messages will be used for the leader to notify the termination. Observe that all the rings R(2), . . . , R(k) do not have links in common (Exercise 3.10.70). This means that if we consider the graph G composed of all these rings, then the number of links m(G) of G is exactly m(G) = n(2) + . . . + n(k). Thus, to determine the cost of the protocol, we need to ﬁnd out the value of m(G). This can be determined in many ways. In particular, it follows from a very interesting property of those rings. In fact, each R(i) is “contained” in the interior of R(i + 1): All the links of R(i) are chords of R(i + 1), and these chords do not cross. This means that the graph G formed by all these rings is planar; that is, can be drawn in the plane without any edge crossing. A well known fact of planar graphs is that they are sparse, that is, they contain very few links: not more than 3(n − 2) (if you did not know it, now you do). This means that our graph G has m(G) ≤ 3n − 6. As our protocol, which we shall call Kelect-Stages, uses 2(n(1) + m(G)) + n messages in the worst case, and n(1) = n, we have M[Kelect–Stages] < 8n − 12. A less interesting but more accurate measurement of the message costs follows from observing that the nodes in each ring R(i) are precisely the entities that were candidates in stage i − 1; thus, n(i) = ni−1 . Recalling that ni ≤ 21 ni−1 , and as n1 = n,

ELECTION IN CHORDAL RINGS ()

we have n(1) + n(2) + . . . + n(k) ≤ n +

k−1 i=1

183

ni < 3n, which will give

M[Kelect–Stages] < 7n

(3.39)

Notice that if we were to use Alternate instead of Stages as ring protocol (as we can), we would use fewer messages (Exercise 3.10.72). In any case, the conclusion is that the chordal labeling allows us to ﬁnally harvest the communication power of complete graphs and do better than in ring networks.

3.7 ELECTION IN CHORDAL RINGS () We have seen how election requires ⍀(n log n) messages in rings and can be done with just O(n) messages in complete networks provided with chordal labeling. Interestingly, oriented rings and complete networks with chordal labeling are part of the same family of networks, known as loop networks or chordal rings. 3.7.1 Chordal Rings A chordal ring Cn d1 , d2 , ..., dk of size n and k-chord structure d1 , d2 , ..., dk , with d1 = 1, is a ring Rn of n nodes {p0 , p1 , ..., pn−1 }, where each node is also directly connected to the nodes at distance di and N − di by additional links called chords. The link connecting two nodes is labeled by the distance that separates these two nodes on the ring, that is, following the order of the nodes on the ring: Node pi is connected to the node pi+dj mod n through its link labeled dj (as shown in Figure 3.48). In particular, if the link between p and q is labeled d at p, this link is labeled n − d at q. Note that the oriented ring is the chordal ring Cn 1 where label 1 corresponds to “right,” and n − 1 to “left.” The complete graph with chordal labeling is the chordal

FIGURE 3.48: Chordal ring C11 1, 3.

184

ELECTION

ring Cn 1, 2, 3, · · · , n/2 In fact, rings and complete graphs are two extreme topologies among chordal rings. Clearly, we can exploit the techniques we designed for complete graph with chordal labeling to develop an efﬁcient election protocol for the entire class of chordal ring networks. The strategy is simple: 1. Execute an efﬁcient ring election protocol (e.g., Stages or Alternate) on the outer ring. As we did in Kelect, the message sent in a stage will carry a counter, updated using the link labels, that will be used to compute the distance between two successive candidates. 2. Use the chords to bypass defeated nodes in the next stage. Clearly, the more the distances can be “bypassed” by the chords, the more the messages we will be able to save. As an example, consider the chordal ring Cn 1, 2, 3, 4, ..., t, where every entity is connected to its distance-t neighborhood in the ring. In this case (Exercise 3.10.76), a leader can be elected with a number of messages not more than O n+

n t

log nt .

A special case of this class is the complete graph, where t = n/2; in it we can bypass any distance in a single “hop” and, as we know, the cost becomes O(n). Interestingly, we can achieve the same O(n) result with fewer chords. In fact, consider the chordal ring Cn 1, 2, 4, 8, ..., 2 log n/2 ; it is called double cube and k = log n. In a double cube, this strategy allows election with just O(n) messages (Exercise 3.10.78), like if we were in a complete graph and had all the links. At this point, an interesting and important question is what is the smallest set of links that must be added to the ring to achieve a linear election algorithm. The double cube indicates that k = O(log n) sufﬁces. Surprisingly, this can be signiﬁcantly further reduced (Problem 3.10.12); furthermore, in that case (Problem 3.10.13), the O(n) cost can be obtained even if the links have arbitrary labels. 3.7.2 Lower Bounds The class of chordal rings is quite large; it includes rings and complete graphs, and the cost of electing a leader varies greatly depending on the structure. For example, we have already seen that the complexity is ⌰(n log n) and ⌰(n) in those two extreme chordal rings. We can actually establish precisely the complexity of the election problem for the entire class of chordal rings Cnt = Cn 1, 2, 3, 4..., t. In fact, we have (Exercise 3.10.77)

n n M(Elect/I R; Cnt ) = ⍀ n + log . t t

(3.40)

UNIVERSAL ELECTION PROTOCOLS

185

Notice that this class includes the two extremes. In view of the matching upper bound (Exercise 3.10.76), we have Property 3.7.1 The message complexity of Elect in Cnt under IR is ⌰ n +

n t

log nt .

3.8 UNIVERSAL ELECTION PROTOCOLS We have so far studied in detail the election problem in speciﬁc topologies; that is, we have developed solution protocols for restricted classes of networks, exploiting in their design all the graph properties of those networks so as to minimize the costs and increase the efﬁciency of the protocols. In this process, we have learned some strategies and principles, which are, however, very general (e.g., the notion of electoral stages), as well as the use of known techniques (e.g., broadcasting) as modules of our solution. We will now focus on the main issue, the design of universal election protocols, that is, protocols that run in every network, requiring neither a priori knowledge of the topology of the network nor that of its properties (not even its size). In terms of communication software, such protocols are obviously totally portable, and thus highly desirable. We will describe two such protocols, radically different from each other. The ﬁrst, Mega-Merger, which constructs a rooted spanning tree, is highly efﬁcient (optimal in the worst case); the protocol is, however, rather complex in terms of both speciﬁcations and analysis, and its correctness is still without a simple formal proof. The second, Yo-Yo, is a minimum-ﬁnding protocol that is exceedingly simple to specify and to prove correct; its real cost is, however, not yet known. 3.8.1 Mega-Merger In this section, we will discuss the design of an efﬁcient algorithm for leader election, called Mega-Merger. This protocol is topology independent (i.e., universal) and constructs a (minimum cost) rooted spanning tree of the network. Nodes are small villages each with a distinct name, and edges are roads each with a different distance. The goal is to have all villages merge into one large megacity. A city (even a small village will be considered such) always tries to merge with the closest neighboring city. When merging, there are several important issues that must be resolved. First and foremost is the naming of the new city. The resolution of this issue depends on how far the involved cities have progressed in the merging process, that is, on the level they have reached and on whether the merger decision is shared by both cities. The second issue to be resolved during a merging is the decision of which roads of the new city will be serviced by public transports. When a merger occurs, the roads of the new city serviced by public transports will be the roads of the two cities already serviced plus only the shortest road connecting them.

186

ELECTION

Let us clarify some of these concepts and notions, as well as the basic rules of the game. 1. A city is a rooted tree; the nodes are called districts, and the root is also known as downtown. 2. Each city has a level and a unique name; all districts eventually know the name and the level of their city. 3. Edges are roads, each with a distinct distance (from a totally ordered set). The city roads are only those serviced by public transport. 4. Initially, each node is a city with just one district, itself, and no roads. All cities are initially at the same level. Note that as a consequence of rule (1), every district knows the direction (i.e., which of its links in the tree leads) to its downtown (Figure 3.49). 5. A city must merge with its closest neighboring city. To request the merging, a Let-us-Merge message is sent on the shortest road connecting it to that city. 6. The decision to request for a merger must originate from downtown and until the request is resolved, no other request can be issued from that city.

D(A)

FIGURE 3.49: A city is a tree rooted in its downtown.

UNIVERSAL ELECTION PROTOCOLS

187

7. When a merger occurs, the roads of the new city serviced by public transports will be the roads of the two cities already serviced plus the shortest road connecting them. Thus, to merge, the downtown of city A will ﬁrst determine the shortest link, which we shall call the merge link, connecting it to a neighboring city; once this is done, a Let-us-Merge is sent through that link; the message will contain information identifying the city, its level, and the chosen merge link. Once the message reaches the other city, the actual merger can start to take place. Let us examine the components of this entire process in some details. We will consider city A, denote by D(A) its downtown, by level(A) its current level, and by e(A) = (a, b) the merge link connecting A to its closest neighboring city; let B be such a city. Node b will be called the entry point of the request from A to B, and node a the exit point. Once the Let-us-Merge message from a in A reaches the district b of B, three cases are possible. If the two cities have the same level and each asks to merge with the other, we have what is called a friendly merger: The two cities merge into a new one; to avoid any conﬂict, the new city will have a new name and a new downtown, and its level is increased: 8. If level(A) = level(B) and the merge link chosen by A is the same as that chosen by B (i.e., e(A) = e(B)), then A and B perform a friendly merger. If a city asks a merger with a city of higher level, it will just be absorbed, that is, it will acquire the name and the level of the other city: 9. If level(A) < level(B), A is absorbed in B. In all other cases, the request for merging and, thus, the decision on the name are postponed : 10. If level(A) = level(B), but the merge link chosen by A is not the same as that chosen by B (i.e., e(A) = e(B)), then the merge process of A with B is suspended until the level of b’s city becomes larger than that of A. 11. If level(A) > level(B), the merge process of A with B is suspended: x will locally enqueue the message until the level of b’s city is at least as large as the one of A. (As we will see later, this case will never occur.) Let us see these rules in more details. Absorption The absorption process is the conclusion of a merger request sent by A to a city with a higher level (rule 9). As a result, city A becomes part of city

188

ELECTION

B acquiring the name, the downtown, and the level of B. This means that during absorption, (i) the logical orientation of the roads in A must be modiﬁed so that they are directed toward the new downtown (so rule (1) is satisﬁed); (ii) all districts of A must be notiﬁed of the name and level of the city they just joined (so rule (2) is satisﬁed). All these requirements can be easily and efﬁciently achieved. First of all, the entry point b will notify a (the exit point of A) that the outcome of the request is absorption, and it will include in the message all the relevant information about B (name and level). Once a receives this information, it will broadcast it in A; as a result, all districts of A will join the new city and know its name and its level. To transform A so that it is rooted in the new downtown is fortunately simple. In fact, it is sufﬁcient to logically direct toward B the link connecting a to b and to “ﬂip” the logical direction only of the edges in the path from the exit point a to the old downtown of A (Exercise 3.10.79), as shown in Figure 3.50. This can be done as follows: Each of the districts of B on the path from a to D(A), when it receives the broadcast from a, will locally direct toward B two links: the one from which the broadcast message is received and the one toward its old downtown.

D(A)

D(B)

a

b

FIGURE 3.50: Absorption. To make the districts of A be rooted in D(B), the logical direction of the links (in bold) from the downtown to the exit point of A has been “ﬂipped.”

Friendly Merger If A and B are at the same level in the merging process (i.e., level(A) = level(B)) and want to merge with each other (i.e., e(A) = e(B)), we have

UNIVERSAL ELECTION PROTOCOLS

189

a friendly merger. Notice that if this is the case, a must also receive a Let-us-Merge message from b. The two cities now become one with a new downtown, a new name, and an increased level: (i) The new downtown will be the one of a and b that has smaller id (recall that we are working under the ID restriction). (ii) The name of the new city will be the name of the new downtown. (iii) The level will be increased by one unit. Both a and b will independently compute the new name, level, and downtown. Then each will broadcast this information to its old city; as a result, all districts of A and B will join the new city and know its name and its level. Both A and B must be transformed so that they are rooted in the new downtown. As discussed in the case of absorption, it is sufﬁcient to “ﬂip” the logical direction only of the edges in the path from the a to the old downtown of A, and of those in the path from b to the old downtown of B (Figure 3.51). Suspension In two cases (rules (10) and (11)), the merge request of A must be suspended: b will then locally enqueue the message until the level of its city is such that it can apply rule (8) or (9). Notice that in case of suspension, nobody from city A knows that their request has been suspended; because of rule (6), no other request can be launched from A. Choosing the Merging Edge According to rule (6), the choice of the merging edge e(A) in A is made by the downtown D(A); according to rule (5), e(A) must be the shortest road connecting A to a neighboring city. Thus, D(A) needs to ﬁnd the minimum length among all the edges incident on the nodes of the rooted tree A; this will be done by implementing rule (5) as follows: (5.1) Each district ai of A determines the length di of the shortest road connecting it to another city (if none goes to another city, then di = ∞). (5.2) D(A) computes the smallest of all the di . Concentrate on part (5.1) and consider a district ai ; it must ﬁnd among its incident edges the shortest one that leads to another city. IMPORTANT. Obviously, ai does not need to consider the internal roads (i.e., those that connect it to other districts of A). Unfortunately, if a link is unused, that is, no message has been sent or received through it, it is impossible for ai to know if this road is internal or leads to a neighboring city (Figure 3.52). In other words, ai must also try the internal unused roads.

190

ELECTION

D(A)

D(B)

a

b

(a)

a

b

(b)

FIGURE 3.51: Friendly merger. (a) The two cities have the same level and choose the same merge link. (b) The new downtown is the exit node (a or b) with smallest id.

Thus, ai will determine the shortest unused edge e, prepare a Outside? message, send it on e, and wait for a reply. Consider now the district c on the other side of e, which receives this message; c knows the name(C) and the level(C) of its city (which could, however, be changing).

UNIVERSAL ELECTION PROTOCOLS

191

D(A)

FIGURE 3.52: Some unused links might lead back to the city.

If name(A) = name(C) (recall that the message contains the name of A), c will reply Internal to ai , the road e will be marked as internal (and no longer used in the protocol) by both districts, and ai will restart its process to ﬁnd the shortest local unused edge. If name(A) = name(C), it does not necessarily mean that the road is not internal. In fact, it is possible that while c is processing this message, its city C is being absorbed by A. Observe that in this case, level(C) must be smaller than level(A) (because by rule (8) only a city with smaller level will be absorbed). This means that if name(A) = name(C) but level(C) ≥ level(A), then C is not being absorbed by A, and C is for sure a different city; thus, c will reply External to ai , which will have, thus, determined what it was looking for: di = length(e). The only case left is when name(A) = name(C) and level(C) < level(A), the case in which c cannot give a sure answer. So, it will not: c will postpone the reply until the level of its city becomes greater than or equal to that of A. Note that this means that the computation in A is suspended until c is ready. NOTE. As a consequence of this last case, rule (11) will never be applied (Exercise 3.10.80). In conclusion to determine if a link is internal should be simple, but, due to concurrency, the process is neither trivial nor obvious. Concentrate on part (5.2). This is easy to accomplish; it is just a minimum ﬁnding in a rooted tree, for which we can use the techniques discussed in Section 2.6.7. Speciﬁcally, the entire process is composed of a broadcast of a message informing all districts in the city of the current name and level (i) of the city, followed by a covergecast. Issues and Details We have just seen in details the process of determining the merge link as well as the rules governing a merger. Because of the asynchronous

192

ELECTION

nature of the system and its unpredictable (though ﬁnite) communication delays, it will probably be the case that different cities and districts will be at different levels at the same time. In fact, our rules take explicitly into account the interaction between neighboring cities at different levels. There are a few situations where the application of the rules will not be evident and thus require a more detailed treatment. (I) Discovering a friendly merger We have seen that when the Let-us-Merge message from A to B arrives at b, if level(A) = level(B), the outcome will be different (friendly merger or postponement) depending on whether e(A) = e(B) or not. Thus, to decide if it is a friendly merger, b needs to know both e(A) and e(B). When the Let-us-Merge message sent from a arrives to b, it knows e(A) = (a, b). Question. How does b know e(B)? The answer is interesting. As we have seen, the choice of e(B) is made by the downtown D(B), which will forward the merger request message of B towards the exit point. If e(A) = e(B), b is the exit point and, thus, it will eventually receive the message to be sent to a; then (and only then) b will know the answer to the question, and that it is dealing with a friendly merger. If e(A) = e(B), b is not the exit point. Note that, unless b is on the way from downtown D(B) to the exit point, b will not even know what e(B) is. Thus, what really happens when the Let-us-Merge message from A arrives at b, is the following. If b has received already a Let-us-Merge message from its downtown to be sent to a, then b knows that is a friendly merger; also a will know when it receives the request from b. (Note for hackers: thus, in this case, no reply to the request is really necessary.) Otherwise b does not know; thus it waits: if it is a friendly merger, sooner or later the message from its downtown will arrive and b will know; if B is requesting another city, eventually the level of b’s city will increase becoming greater than level(A) (which, as A is still waiting for the reply, cannot increase), and thus result in A being absorbed. (II) Overlapping discovery of an internal link In the merge-link calculation, when the Outside? message from a in A is sent to neighbor b in B, if name(A) = name(B) then the link (a, b) is internal and should be removed from consideration by both a and b. As b knows (it just found out receiving the message) but a possibly does not, b will send to a the reply Internal. However, if b also had sent to a an Outside? message, when a receives that message, it will ﬁnd out that (a, b) is internal, and the Internal reply would be redundant. In other words, if a and b from the same city independently send to each other an Outside? message, there is no need for either of them to reply Internal to the other. (III) Interaction between absorption and link calculation A situation that requires attention is due to the interaction between merge-link calculation and absorption. Consider the Let-us-Merge message sent by a on merge

UNIVERSAL ELECTION PROTOCOLS

193

link e(A) = (a, b) to b, and let level(A) = j < i = level(B); thus, A will have to be absorbed in B. Suppose that, when b receives the message, it is computing the merge link for its city B; as its level is i, we will call it the i-level merge link. What b will do in this case, is to ﬁrst proceed with the absorption of A (so to involve it in the i-level merge-link computation), and then to continue its own computation of the merge link. More precisely, b will start the broadcast in A of the name and level of B asking the districts there to participate in the computation of the i-level merge link for B, and then resume its computation. Suppose instead that b has already ﬁnished computing the i-level merge link for its city B; in this case, b will broadcast in A the name and level of B (so to absorb A), but without requesting them to participate in the computation of the i-level merge link for B (it is too late). (IV) Overlap between notiﬁcation and i-level merge-link calculation As mentioned, the i-level merge-link calculation is started by a broadcast informing all districts in the city of the current name and level (i) of the city. Let us call “startnext" the function provided by these messages. Notice that broadcasts are already used following the discovery of a friendly merger or an absorption. Consider the case of a friendly merger. When the two exit points know that it is a friendly merger, the notiﬁcation they broadcast will inform all districts in the merged city of the new level, new name, and to start computing the next merge link. In other words, the notiﬁcation is exactly the “start next” broadcast. In the case of an absorption, as we just discussed, a “start-next” broadcast is needed only if it is not too late for the new districts to participate in the current calculation of the merge link. If it is not too late, the notiﬁcation message contains the request to participate in the next merge-link calculation; thus, it is just the propagation of the current “start-next” broadcast in this new part of the city. In other words, the “notiﬁcation” broadcasts act as “start-next” broadcasts, if needed. 3.8.2 Analysis of Mega-Merger A city only carries out one merger request at a time, but it can be asked concurrently by several cities, which in turn can be asked by several others. Some of these requests will be postponed (because the level is not right, or the entry node does not (yet) know what the answer is, etc.) Due to communication delays, some districts will be taking decisions on the basis of the information (level and name of its city) that is obsolete. It is not difﬁcult to imagine very intricate and complex scenarios that can easily occur. How do we know that, in spite of concurrency and postponements and communication delays, everything will eventually work out? How can we be assured that some decisions will not be postponed forever, that is, there will not be deadlock? What guarantees that, in the end, the protocol terminates and a single leader will be elected? In other words, how do we know that the protocol is correct?

194

ELECTION

Because of its complexity and the variety of scenarios that can be created, there is no satisfactory complete proof of the correctness of the Mega-Merger protocol. We will discuss here a partial proof that will be sufﬁcient for our learning purposes. We will then analyze the cost of the Protocol. Finally, we will discuss the assumption of having distinct lengths associated to the links, examine some interesting connected properties, and then remove the assumption. Progress and Deadlock We will ﬁrst discuss the progress of the computation and the absence of deadlock. To do so, let us pinpoint the cases when the activity of a city C is halted by a district d of another city D. This can occur only when computing the merge edge, or when requesting a merger on the merge edge e(C); more precisely, there are three cases: (i) When computing the merge edge, a district c of C sends the Outside? message to d and D has a smaller level than C. (ii) A district c of C sends the Let-us-Merge message on the merge edge e(C) = (c, d); D and C have the same level but it is not a friendly merger. (iii) A district c of C sends the Let-us-Merge message on the merge edge e(C) = (c, d); D and C have the same level and it is a friendly merger, but d does not know yet. In cases (i) and (ii), the activities of C are suspended and will be resolved (if the protocol is correct) only in the “future,” that is, after D changes level. Case (iii) is different in that it will be resolved within the “present” (i.e., in this level); we will call this case a delay rather than a suspension. Observe that if there is no suspension, there is no problem. Property 3.8.1 If a city at level l will not be suspended, its level will eventually increase (unless it is the megacity). To see why this is true, consider the operations performed by a city C at a level l: Compute the merge edge and send a merge request on the merge edge. If it is not suspended, its merge request arrives at a city D with either a larger level (in which case, C is absorbed and its level becomes level(D)) or the same level and same merge edge (the case in which the two cities have a friendly merger and their level increases). So, only suspensions can create problems, but not necessarily so. Property 3.8.2 Let city C at level l be suspended by a district d in city D. If the level of the city of D becomes greater than l, C will no longer be suspended and its level will increase. This is because once the level of D becomes greater than the level of C, d can answer the Outside? message in case (i), as well as the Let-us-Merge message in case (ii). Thus, the only real problem is the presence of a city suspended by another whose level will not grow. We are now going to see that this cannot occur.

UNIVERSAL ELECTION PROTOCOLS

195

Consider the smallest level l of any city at time t, and concentrate on the cities C operating at that level at that time. Property 3.8.3 No city in C will be suspended by a city at higher level. This is because for a suspension to exist, the level of D can not be greater than the level of C (see the cases above). Thus, if a city C ∈ C is suspended, it is for some other city C ∈ C. If C is not suspended at level l, its level will increase; when that happens, C will no longer be suspended. In other words, there would be no problems as long as there are no cycles of suspensions within C, that is, as long as there is no cycle C0 , C1 , . . . , Ck−1 of cities of C where Ci is suspended by Ci+1 (and the operation on the indices are modulo k). The crucial property is the following: Property 3.8.4 There will be no cycles of suspensions within C. The proof of this property is based heavily on the fact that each edge has a unique length (we have assumed that.) and that the merge edge e(C) chosen by C is the shortest of all the unused links incident on C. Remember this fact and let us proceed with the proof. By contradiction, assume that the property is false. That is, assume there is a cycle C0 , C1 , . . . , Ck−1 of cities of C where Ci is suspended by Ci+1 (the operation on the indices are modulo k). First of all observe that as all these cities are at the same level, the reason they are suspended can only be that each is involved in an “unfriendly” merger, that is, case (ii). Let us examine the situation more closely: Each Ci has chosen a merge edge e(Ci ) connecting it to Ci+1 ; thus, Ci is suspending Ci−1 and is suspended by Ci+1 . Clearly, both e(Ci−1 ) and e(Ci ) are incident on Ci . By deﬁnition of merging edge (recall what we said at the beginning of the proof), e(Ci ) is shorter than e(Ci−1 ) (otherwise Ci would have chosen it instead); in other words, the length di of the road e(Ci ) is smaller than the length di11 of e(Ci+1 ). This means that d0 > d1 > . . . > dk−1 , but as it is a circle of suspensions, Ck−1 is suspended by C0 , that is, dk−1 > d0 . We have reached a contradiction, which implies that our assumption that the property does not hold is actually false; thus, the property is true. As a consequence of the property, all cities in C will eventually increase their level: ﬁrst, the ones involved in a friendly merger, next those that had chosen them for a merger (and thus absorbed by them), then those suspended by the latter, and so on. This implies that at no time there will be deadlock and there is always progress: Use the properties to show that the ones with smallest level will increase their value; when this happens, again the ones with smallest level will increase it, and so on. That is, Property 3.8.5 Protocol Mega-Merger is deadlock free and ensures progress. Termination We have just seen that there will be no deadlock and that progress is guaranteed. This means that the cities will keep on merging and eventually the

196

ELECTION

megacity will be formed. The problem is how to detect that this has happened. Recall that no node has knowledge of the network, not even of its size (it is not part of the standard set of assumptions for election); how does an entity ﬁnds out that all the nodes are now part of the same city? Clearly, it is sufﬁcient for just one entity to determine termination (as it can then broadcast it to all the others). Fortunately, termination detection is simple to achieve; as one might have suspected, it is the downtown of the megacity that will determine that the process is terminated. Consider the downtown D(A) of city A, and the operations it performs: It coordinates the computation of the merge link and then originates a merge request to be sent on that link. Now, the merge link is the shortest road going to another city. If A is already the megacity, there are no other cities; hence all the unused links are internal. This means that when computing the merge link, every district will explore every unused link left and discover that each one of them is internal; it will thus choose ∞ as its length (meaning that it does not have any outgoing links). This means that the minimum-ﬁnding process will return ∞ as the smallest length. When this happens, D(A) understands that the mega-merger is completed, and can notify all others. (Notiﬁcation is not really necessary: Exercise 3.10.81.) As the megacity is a rooted tree with the downtown as its root, D(A) becomes the leader; in other words, Property 3.8.6 Protocol Mega-Merger correctly elects a leader. Cost In spite of the complexity of protocol Mega-Merger, the analysis of its cost is not overly difﬁcult. We will ﬁrst determine how many levels there can be and then calculate the total number of messages transmitted by entities at a given level. The Number of Levels A district acquires a larger level because its city has been either absorbed or involved in a friendly merger. Notice that when there is absorption, only the districts in one of the two cities increase their level, and thus the max level in the system will not be increased. The max level can only increase after a friendly merger. How high can the max level be ? We can ﬁnd out by linking the minimum number of districts in a city to the level of the city. Property 3.8.7 A city of level i has at least 2i districts. This can be proved easily by induction. It is trivially true at the beginning (i.e., i = 0). Let it be true for 0 ≤ i ≤ k − 1. A level k city can only be created by a friendly merger of two level k − 1 cities; hence, by inductive hypothesis, such a city will have at least 2 2k−1 = 2k districts; thus the property is true also for i = k. As a consequence, Property 3.8.8 No city will reach a level greater than log n.

UNIVERSAL ELECTION PROTOCOLS

197

The Number of Messages per Level Consider a level i; some districts will reach this level from level i − 1 or even lower; others might never reach it (e.g., because of absorption, they move from a level lower than i directly to one larger than i). Consider only those districts that do reach level i and let us count how many messages they transmit in this level. In other words, as each message contains the level, we need to determine how many messages are sent in which the level is i. We do know that every district (except the downtown) of a city of level i receives a broadcast message informing it that its current level is i, and to start computing the i-level merge-link (this last part may not be included). Hence at most every district will receive such a message, accounting for a total of n messages. If the received broadcast also requests to compute the i-level edge-merge link, a district must ﬁnd its shortest outgoing link, by using Outside? messages. IMPORTANT. For the moment, we will not consider the Outside? messages sent to internal roads (i.e., where the reply is Internal); they will be counted separately later. In this case, the district will send at most one Outside? message that causes a reply External. The district will then participate in the convergecast, sending one message toward the downtown. Hence, all these activities will account for a total of at most 3n messages. Once the i-level merge-links have been determined, the Let-us-Merge messages are originated and sent to and across the merge-links. Regardless of the ﬁnal outcome of the request, the forwarding of the i-level Let-us-Merge message from the downtown D(A) to the new city through the merge edge e(A) = (a, b) will cause at most n(A) transmissions in a city A with n(A) districts (n(A) − 1 internal and one on the merge edge). This means that these activities will cost in total at most n(A) ≤ n A∈City(i)

messages where City(i) is the set of the cities reaching level i. This means that excluding the number of level i messages Outside? whose reply is Internal, the total number of messages sent in level i is Property 3.8.9 Cost(i) ≤ 5n The Number of Useless Messages In the calculation so far we have excluded the Outside? messages whose reply was Internal. These messages are in a sense “useless” as they do not bring about a merger; but they are also unavoidable. Let us measure their number. On any such road there will be two messages, either the Outside? message and the Internal reply, or two Outside? messages. So, we only need to determine the number of such roads. These roads are not part of the city (i.e., not serviced by public transport). As the ﬁnal city is a tree, the total number of the publicly serviced roads is exactly n − 1. Thus, the total number of the other roads is exactly m − (n − 1). This means that the total number of useless messages will be Property 3.8.10 Useless = 2(m − n + 1)

198

ELECTION

The Total Combining Properties 3.8.8, 3.8.9, and 3.8.10, we obtain the total number of messages exchanged in total by protocol Mega-Merger during all its levels of execution. To these, we need to add the n − 1 messages because of the downtown of the megacity broadcasting termination (eventhough these could be saved: Exercise 3.10.81), for a total of M[Mega – Merger] ≤ 2m + 5n log n + n + 1.

(3.41)

Road Lengths and Minimum-Cost Spanning Trees In all the previous discussions we have made some nonstandard assumptions about the edges. We have in fact assumed that each link has a value, which we called length, and that those values are unique. The existence of link values is not uncommon. In fact, dealing with networks, usually there is a value associated with a link denoting, for example, the cost of using that link, the transmission delays incurred when sending a message through it, and so forth. In these situations, when constructing a spanning tree (e.g., to use for broadcasting), the prime concern is how to construct the one of minimum cost, that is, where the sum of the values of its link is as small as possible. For example, if the value of the link is the cost of using it, a minimum-cost spanning tree is one where broadcasting would be the cheapest (regardless of who is the originator of the broadcast). Not surprisingly, the problem of constructing a minimum-cost spanning tree is important and heavily investigated. We have seen that protocol Mega-Merger constructs a rooted spanning tree of the network. What we are going to see now is that this tree is actually the unique minimumcost spanning tree of the network. We are also going to see how the nonstandard assumptions that we have made about the existence of unique lengths can be easily removed. Minimum-Cost Spanning Trees In general, a network can have several minimumcost spanning trees. For example, if all links have the same value (or have no value), then every spanning tree is minimal. By contrast, Property 3.8.11 If the link values are distinct, a network has a unique minimum-cost spanning tree. Assuming that there are distinct values associated to the links, protocol MegaMerger constructs a rooted spanning tree of the network. What we are going to see now is that this tree is actually the unique minimum-cost spanning tree of the network. To see why this is the case, we must observe a basic property of the minimum-cost spanning tree T . A fragment of T is a subtree of T . Property 3.8.12 Let A be a fragment of T, and let e be the link of minimum value among those connecting A to other fragments; let B be the fragment connected by A. Then the tree composed by merging A and B through e is also a fragment of T.

UNIVERSAL ELECTION PROTOCOLS

199

This is exactly what the Mega-Merger protocol does: It constructs the minimumcost spanning tree T (the megacity) by merging fragments (cities) through the appropriate edges (merge link). Initially, each node is a city and, by deﬁnition, a single node is a fragment. In general, each city A is a fragment of T ; its merge link is chosen as the shortest (i.e., minimum value) link connecting A to any neighboring city (i.e., fragment); hence, by Property 3.8.12, the result of the merger is also a fragment. Notice that the correctness of the process depends crucially on Property 3.8.11, and thus on the distinctness of the link values. Creating Unique Lengths We will now remove the assumptions that there are values associated to the links and these values are unique. If there are no values (the more general setting), then a unique value can be easily given to each link using the fact that the nodes have unique ids: To link e = (a, b) associate the sorted pair d(e) = Min{id(a), id(b)}, Max{id(a), id(b)} and use the lexicographic ordering to determine which edge has smaller length. So, for example, the link between nodes with ids 17 and 5 will have length 5, 17, which is smaller than 6, 5 but greater than 4, 32. To do this requires, however, that each node knows the id of all its neighbors. This information can be acquired in a preprocessing phase, in which every node sends to its neighbors, its id (and will receive theirs from them); the cost will be two additional messages on each link. Thus, even if there are no values associated to the links, it is possible to use protocol Mega-Merger. The price we have to pay is 2m additional messages. If there are values but they are not (known to be) unique, they can be made so, again using the fact that the nodes have unique ids. To link e = (a, b) with value v(e) associate the sorted triple d(e) = v(e), Min{id(a), id(b)}, Max{id(a), id(b)}. Thus, links with the same values will now be associated to different lengths. So, for example, the link between nodes with ids 17 and 5 and value 7 will have length 7, 5, 17, which is smaller than 7, 6, 5 but greater than 7, 4, 32. Also, in this case, each node needs to know the id of all its neighbors. The same preprocessing phase will achieve the goal with only 2m additional messages. Summary Protocol Mega-Merger is a universal protocol that constructs a (minimum-cost) spanning tree and returns it rooted in a node, thus electing a leader. If there are no initial distinct values on the links, a preprocessing phase needs to be added, in which each entity exchanges its unique id with its neighbors; then the actual execution of the protocol can start. The total cost of the protocol (with or without preprocessing phase) is O(m + n log n), which, we will see, is worst case optimal. The main drawback of Mega-Merger is its design complexity, which makes any actual implementation difﬁcult to verify. 3.8.3 YO-YO We will now examine another universal protocol for leader election. Unlike the previous one, it has simple speciﬁcations, and its correctness is simple to establish. This protocol, called YO-YO, is a minimum-ﬁnding algorithm and consists of two parts: a preprocessing phase and a sequence of iterations. Let us examine them in detail.

200

ELECTION

Setup In the preprocessing phase, called Setup, every entity x exchanges its id with its neighbors. As a result, it will receive the id of all its neighbors. Then, x will logically orient each incident link (x, y) in the direction of the entity (x or y), with the largest id. So, if id(x) = 5 and its neighbor y has id(y) = 7, x will orient (x, y) toward y; notice that y will also do the same. In fact, the orientation of each link will be consistent at both end nodes. so obtained. There is a very simple but important Consider now the directed graph G property: is acyclic. Property 3.8.13 G To see why this is true, consider by contradiction the existence of a directed cycle x0 , x1 , . . . , xk ; this means that id(x0 ) < id(x1 ) < . . . < id(xk−1 ) but, as it is a cycle, id(xk−1 ) < id(x0 ), which is impossible. is a directed acyclic graph (DAG). In a DAG, there are three This means that G types of nodes: is a node – source is a node where all the links are out-edges; thus, a source in G with an id smaller than that of all its neighbors, that is, it is a local minimum; is a node whose – sink is a node where all the links are in-edges; thus, a sink in G id is larger than that of all its neighbors, that is, it is a local maximum; – internal node is a node, which is neither a source nor a sink. As a result of the setup, each node will know whether it is a source, a sink, or an internal node. We will also use the terminology of “down” referring to the direction toward the sinks, and “up” referring to the direction toward the sources (see Figure 3.53). Once this preprocessing is completed, the second part of the algorithm start. As YO-YOs is a minimum-ﬁnding protocol, only the local minima (i.e., the sources) will be the candidates (Figure 3.54). Iteration The core of the protocol is a sequence of iterations. Each iteration acts as an electoral stage in which some of the candidates are removed from consideration. Each iteration is composed of two parts, or phases, called YO- and -YO. YO- This phase is started by the sources. Its purpose is to propagate to each sink the smallest among the values of the sources connected to that sink (see Figure 3.54(a)). 1. A source sends its value down to all its out-neighbors. 2. An internal node waits until it receives a value from all its in-neighbors. It then computes the minimum of all received values and sends it down to its out-neighbors. 3. A sink waits until it receives a value from all its in-neighbors. It then computes the minimum of all received values and starts the second part of the iteration. 3

In the sense that there is a directed path from the source to that sink.

201

UNIVERSAL ELECTION PROTOCOLS

5

7

9

12

8

2

1

6

10

11

3

22

15

28

26

13

16

4

17

(a)

2

3

11

5

1

8

7

6

10

12 4

16

9 22

15

28

26

13

17

(b)

FIGURE 3.53: In the Setup phase, (a) the entities know their neighbors’ ids and (b) orient each incident link toward the smaller id, creating a DAG.

-YO This phase is started by the sinks. Its purpose is to eliminate some candidates, transforming some sources into sinks or internal nodes. This is done by having the sinks inform their connected sources of whether or not the id they sent is the smallest seen so far (see Figure 3.54(b)). 4. A sink sends YES to all in-neighbors from which the smallest value has been received. It sends NO to all the others. 5. An internal node waits until it receives a vote from all its out-neighbors. If all votes are YES, it sends YES to all in-neighbors from which the smallest value

202

ELECTION

2 2

5 2

2

1

5

6

1

5

1

5 2

2

2

6

1 1

1

2

2

6

2

2

2 2

2

2 2

2

(a)

2 Y

5 Y

Y

NO

1 NO

Y

Y

Y

Y Y

NO

Y

Y Y

Y

NO

NO Y

6

Y

Y

Y Y

Y Y Y

Y

(b)

FIGURE 3.54: In the Iteration stage, only the candidates are sources. (a) In the YO- phase, the ids are ﬁltered down to the sinks. (b) In the -YO phase, the votes percolate up to the sources.

has been received and NO to all the others. If at least a vote was NO, it sends NO to all its in-neighbors. 6. A source waits until it receives a vote from all its out-neighbors. If all votes are YES, it survives this iteration and starts the next one. If at least a vote was NO, it is no longer a candidate. Before the next iteration can be started, the directions on the links in the DAG must be modiﬁed so that only the sources that are still candidate (i.e., those that received only YES) will still be sources; clearly, the modiﬁcation must be done

UNIVERSAL ELECTION PROTOCOLS

2

5

1

203

6

(a)

2

1

5 6

(b)

FIGURE 3.55: (a) In the -YO phase, we ﬂip the logical direction of the links on which a NO is sent, (b) creating a new DAG, where only the surviving candidates will be sources.

without creating cycles. In other words, we must transform the DAG into a new one, whose only sources are the undefeated ones in this iteration. This modiﬁcation is fortunately simple to achieve. We need only to “ﬂip” the direction of each link where a NO vote is sent (see Figure 3.55(a)). Thus, we have two meta-rules for the -YO part: 7. When a node x sends NO to an in-neighbor y, it will reverse the (logical) direction of that link (thus, y becomes now an out-neighbor of x). 8. When a node y receives NO from an out-neighbor x, it will reverse the (logical) direction of that link (thus, x becomes now an in-neighbor of y).

204

ELECTION

As a result, any source that receives a NO will cease to be a source; it can actually become a sink. Some sinks may cease to be such and become internal nodes, and some internal nodes might become sinks. However, no sink or internal node will ever become a source (Exercise 3.10.83). A new DAG is, thus, created, where the sources are only those that received all YES in this iteration (see Figure 3.55(b)). Once a node has completed its part in the -YO phase, it will know whether it is a source, a sink, or an internal node in the new DAG. The next iteration could start now, initiated by the sources of the new DAG. Property 3.8.14 Applying an iteration to a DAG with more than one source will result into a DAG with fewer sources. The source with smallest value will still be a source. In each iteration, some sources (at least one) will be no longer sources; in contrast to this, the source with the smallest value will be eventually the only one left under consideration. In other words, eventually the DAG will have a single source (the overall minimum, say c), and all other nodes are either sinks or internal nodes. How can c determine that it is the only source left, and thus it should become the leader? If we were to perform an iteration now, only c’s value will be sent in the YO- phase, and only YES votes will be sent in the -YO phase. The source c will receive only YES votes; but c has received only YES votes in every iteration it has performed (that is why it survived as a source). How can c distinguish that this time is different, that the process should end? Clearly, we need some additional mechanisms during the iterations. We are going to add some meta-rules, called Pruning, which will allow to reduce the number of messages sent during the iterations, as well as to ensure that termination is detected when only one source is left. Pruning The purpose of pruning is to remove from the computation, nodes and links that are “useless,” do not have any impact on the result of the iteration; in other words, if they were not there, still the same result would be obtained: The same sources would stay sources, and the others defeated. Once a link or a node is declared “useless,” during the next iterations it will be considered nonexistent and, thus, not used. Pruning is achieved through two meta-rules. The ﬁrst meta-rule is a structural one. To explain it, recall that the function of the sinks is to reduce the number of sources by voting on the received values. Consider now a sink that is a leaf (i.e., it has only one in-neighbor); such a node will receive only one value; thus it can only vote YES. In other words, a sink leaf can only agree with the choice (i.e., the decision) made by its parent (i.e., its only neighbor). Thus, a sink leaf is “useless.” 9. If a sink is a leaf (i.e., it has only one in-neighbor), then it is useless; it then asks its parent to be pruned. If a node is asked to prune an out-neighbor, it will do so by declaring useless (i.e., removing from consideration in the next iterations) the connecting link.

UNIVERSAL ELECTION PROTOCOLS

5

5

8

5

205

8

FIGURE 3.56: Rules of pruning.

Notice that after pruning a link, a node might become a sink; if it is also a leaf, then it becomes useless. The other meta-rule is geared toward reducing the communication of redundant information. During YO- phase, a (internal or sink) node might receive the value of the same source from more than one in-neighbor; this information is clearly redundant as, to do its job (choose the minimum received value), it is enough for the node to receive just one copy of that value. Let x receive the value of source s from in-neighbors x1 , . . . , xk , k > 1. This means that in the DAG, there are directed paths from s to (at least) k distinct in-neighbors of x. This also means that if the link between x and one of them, say x1 , did not exist, the value from s would still arrive to x from those other neighbors, x2 , . . . , xk . In fact, if we had removed the links between x and all those in-neighbors except one, x would still have received the value of s from that neighbor. In other words, the links between x and x1 , . . . , xk are redundant: It is sufﬁcient to keep one; all others are useless and can be pruned. Notice that the choice regarding the link that should be kept is irrelevant. 10. If in the YO- phase, a node receives the same value from more than one inneighbor, it will ask all of them except one to prune the link connecting them and it will declare those links useless. If a node receives such a request, it will declare useless (i.e., remove from consideration in the next iterations) the connecting link. Notice that after pruning a link because of rule (10), a sink might become a leaf and thus useless (by rule (9)) (see Figure 3.57).

206

ELECTION

2 2

5 2

2

1

5

6

1

5

1

5 2

2

2

6

1 1

1

2 2

6

2

2

2 2

2

2 2

2

(a) 2

5 Y

1 NO

NO

Y

6 Y

NO NO

NO Y

Y

(b)

FIGURE 3.57: The effects of pruning in the ﬁrst iteration: Some nodes (in black) and links are removed from consideration.

The pruning rules require communication: In rule (7), a sink leaf needs to ask its only neighbor to declare the link between them useless; in rule (8), a node receiving redundant information needs to ask some of its neighbors to prune the connecting link. We will have this communication take place during the -YO phase: The message containing the vote will also include the request, if any, to declare that link useless. In other words, pruning is performed when voting. Let us return now on our concern on how to detect termination. As we will see, the pruning operations, integrated in the -YO phase, will do the trick. To understand how and why, consider the effect of performing a full iteration (with pruning) on a DAG with only one source.

UNIVERSAL ELECTION PROTOCOLS

2

207

1 1

2

1

2

2

1

2

1

1 1

5

6

(a)

2

1 Y

NO

NO

Y

(b)

FIGURE 3.58: The effects of pruning in the second iteration: Other nodes (in black) and links are removed from consideration.

Property 3.8.15 If the DAG has a single source, then, after an iteration, the new DAG is composed of only one node, the source. In other words, when there is a single source c, all other nodes will be removed, and c will be the only useful node left. This situation will be discovered by c when, because of pruning, it will have no neighbors (Figure 3.59). Costs The general formula expressing the costs of protocol YO-YO is easy to establish; however, the exact determination of the costs expressed by the formula is still an open research problem. Let us derive the general formula. In the Setup phase, each node sends its value to all its neighbors; hence, on each link there will be two messages sent, for a total of 2m messages.

208

ELECTION

1

1 1 1

1

1

(a)

(b)

FIGURE 3.59: The effects of pruning in the third iteration: Termination is detected as the source has no more neighbors in the DAG.

Consider now an iteration. In the YO- stage, every useful node (except the sinks) sends a message to its out-neighbors; hence, on each link still under consideration, there will be exactly one message sent. Similarly, in the -YO stage, every useful node (except the sources) sends a message to its in-neighbors; hence, on each link there will be again only one message sent. Thus, in total in iteration i there will be exactly 2mi messages, where mi is the number of links in the DAG used at stage i. The notiﬁcation of termination from the leader can be performed by broadcasting on the constructed spanning tree with only n − 1 messages. Hence, the total cost will be 2

k(G)

mi + n − 1,

i=0

where m0 = m and k(G) is the total number of iterations on network G. be the We need now to establish the number of iterations k(G). Let D(1) = G original DAG obtained from G as a result of setup. Let G(1) be the undirected graph deﬁned as follows: There is a node for each source in D(1) and there is a link between two nodes if and only if the two corresponding sources have a sink in common. Consider now the diameter d(G(1)) of this graph. Property 3.8.16 The number of iteration is at most log diam(G(1)) + 1. To see why this is the case, consider any two neighbors a and b in G(1). As, by deﬁnition, the corresponding sources in D(1) have a common sink, at least one of these two sources will be defeated (because the sink will vote YES to only one of them). This means that if we take any path in G(1), at least half of the nodes on that path will correspond to sources that will cease to be such at the end of this iteration. 4

In a DAG, two sources a and b are said to have a common sink c if c is reachable from both a and b.

UNIVERSAL ELECTION PROTOCOLS

209

Furthermore, if (the source corresponding to) a survives, it will now have a sink in common with each of the undefeated (sources corresponding to) neighbors of b. This means that if we consider the new DAG D(2), the corresponding graph G(2) is exactly the graph obtained by removing the nodes associated to the defeated sources, and linking together the nodes previously at length two. In other words, d(G(2)) ≤

d(G(1))/2. Similar will be the relationship between the graphs G(i − 1) and G(i) corresponding to the DAG D(i − 1) of iteration i − 1 and to the resulting new DAG D(i), respectively. In other words, d(G(i)) ≤ d(G(i − 1))/2. Observe that diam(G(i)) = 1 corresponds to a situation where all sources except one will be defeated in this iteration, and d(G(i)) = 0 corresponds to the situation where there is only one source left (which does not know it yet). As d(G(i)) ≤ 1 after at most log diam(G(1)) iterations, the property follows: As the diameter of a graph cannot be greater than the number of its nodes, and as we have that the nodes of G(1) correspond to the sources of G, ≤ log n . k(G) ≤ log s(G) We can thus establish that without pruning, that is, with mi = m, we have a O(m log n) total cost M[Yo – Yo (without pruning)] ≤ 2 m log n + l.o.t.

(3.42)

The unsolved problem is the determination of the real cost of the algorithm, when the effects of pruning are taken into account. 3.8.4 Lower Bounds and Equivalences We have seen a complex but rather efﬁcient protocol, MegaMerger, for electing a leader in an arbitrary network. In fact, it uses O(m + n log n) messages in the worst case. This means that in a ring network it uses O(n log n) messages and it is thus optimal, without even knowing that the network is a ring. The next question we should ask is how efﬁcient a universal election protocol can be. In other words, what is the complexity of the election problem? The answer is not difﬁcult to derive. First of all observe that any election protocol requires to send a message on every link. To see why this is true, assume by contradiction that indeed there is a correct universal election protocol A that in every network G and in every execution under IR does not send a message on every link of G. Consider a network G and an execution of A in G; let z be the entity that becomes leader and let e = (x, y) ∈ E be a link where no message is transmitted by A (Figure 3.60(a)).

210

ELECTION

H G

G’

G’’ x’

x

a

a z

x’’ a

z’

z’’

b y

b

b

y’

(a)

y’’

(b)

FIGURE 3.60: Every universal election protocol must send messages on every link.

We will now construct a new graph H as follows: We make two copies of G and remove from both of them the edge e; we then connect these two graphs G and G by adding two new edges e1 = (x , x ) and e2 = (y , y ), where x and x (respective y and y ) are the copies of x (respective y) in G and G , respectively, and where the labels are: lx (e1 ) = lx (e1 ) = lx (e) and ly (e1 ) = ly (e2 ) = ly (e) (see Figure 3.60(b)). Run exactly the same execution of A we did in G on the two components G and G of H : As no message was sent along (x, y) in G, this is possible, but as no message was sent along (x, y) in the original execution, x and x will never send messages to each other in the current execution; similarly, y and y will never send messages to each other. This means that the entities of G will never communicate with the entities of G during this execution; thus, they will not be aware of their existence and will operate solely within G ; similarly for the entities of G . This means that when the execution of A in G terminates, entity z will become leader; but similarly, entity z in G will become leader as well. In other words, two leaders will be elected, contradicting the correctness of protocol A. In other words, M(Elect /IR) ≥ m. This lower bound is powerful enough to provide us with interesting and useful information; for example, it states that ⍀(n2 ) messages are needed in a complete graph if you do not know that is a complete graph. By contrast, we know that there are networks where election requires way more than m messages; for example, in rings m = n but we need ⍀(n log n) messages. As a universal election protocol must run in every network, including rings, we can say that in the worst case, M(Elect/IR) ≥ ⍀(m + n log n).

(3.43)

UNIVERSAL ELECTION PROTOCOLS

211

This means that protocol MegaMerger is the worst case optimal and we know the complexity of the election problem. Property 3.8.17 The message complexity of election under IR is ⌰(m + n log n). We are now going to see that constructing a spanning tree SPT and electing a leader Elect are strictly equivalent: Any solution to one of them can be easily modiﬁed so as to solve the other with the same message cost (in order of magnitude). First of all, observe that , similarly to the Election problem, SPT also requires a message to be sent on every link (Exercise 3.10.85): M(SPT/IR) ≥ m.

(3.44)

We are now going to see how we can construct a spanning-tree construction algorithm from any existing election protocol. Let A be an election protocol; consider now the following protocol B: 1. Elect a leader using A. 2. The leader starts the execution of protocol Shout. Recall that protocol Shout (seen in Section 2.5) will correctly construct a spanning tree if there is a unique initiator. As the leader elected in step (1) is unique, a spanning tree will be constructed in step (2). So, protocol B solves SPT. What is the cost ? As Shout uses exactly 2m messages, we have M[B] = M[A] + 2m. In other words, with at most O(m) additional messages, any election protocol can be made to construct a spanning tree; as ⍀(m) messages are needed anyway (Equation 3.44), this means that M(SPT/IR) ≤ M(Elect/IR).

(3.45)

Focus now on a spanning-tree construction algorithm C. Using C as the ﬁrst step, it is easy to construct an election protocol D where (Exercise 3.10.86) M[D] = M[C] + O(n). In other words, the message complexity of Elect is no more than that of Elect plus at most another O(n) messages; as election requires more than O(n) messages anyway (Property 3.8.17), this means that M(Elect/IR) ≤ M(SPT/IR).

(3.46)

212

ELECTION

Combining Equations 3.45 and 3.46, we have not only that the problems are computationally equivalent Elect(I R) ≡ SPT(I R)

(3.47)

but also that they have the same complexity: M(Elect/IR) = M(SPT/IR).

(3.48)

Using similar arguments, it is possible to establish the computational and complexity equivalence of election with several other problems (e.g., see Exercise 3.10.87).

3.9 BIBLIOGRAPHICAL NOTES Election in a ring network is one of the ﬁrst problems studied in distributed computing from an algorithmic point of view. The ﬁrst solution protocol, All the Way, is due to Gerard Le Lann [29] proposal for unidirectional rings. Also for unidirectional rings, protocol AsFar was developed by Ernie Chang and Rosemary Roberts [12]; it was later analyzed experimentally by Friedman Mattern [34] and analytically by Christian Lavault [31]. The probabilistic bidirectional version ProbAsFar was proposed and analyzed by Ephraim Korach, Doron Rotem, and Nicola Santoro [28]. Hans Bodlaender and Jan van Leeuwen later showed how to make it deterministic and provided further analysis [8]; the exact asymptotic average value has been derived by Christian Lavault [30]. The idea beyond the ﬁrst ⌰(n log n) worst-case protocol, Control, is due to Dan Hirschberg and J.B. Sinclair [22]. Protocol Stages was designed by Randolph Franklin [17]; the more efﬁcient Stages with Feedback was developed by Ephraim Korach, Doron Rotem, and Nicola Santoro [27]. The ﬁrst ⌰(n log n) worst case protocol for unidirectional rings, UniStages, was designed by Danny Dolev, Maria Klawe, and Michael Rodeh [15]. The more efﬁcient MinMax is due to Gary Peterson [39]. The even more efﬁcient protocol MinMax+ has been designed by Lisa Higham and Theresa Przytycka [21]. Bidirectional versions of MinMax with the same complexity as the original (Problem 3.10.4) have been independently designed by Shlomo Moran, Mordechai Shalom, and Shmuel Zaks [35], and by Jan van Leeuwen and Richard Tan [44]. The lower bound for unidirectional rings is due to Jan Pachl, Doron Rotem, and Ephraim Korach [36]. James Burns developed the ﬁrst lower bound for bidirectional rings [9]. The lower bounds when n is known (Exercises 3.10.45 and 3.10.47), as well as others, are due to Hans Bodlaender [5–7]. The O(n) election protocol for tori was designed by Gary Peterson [38] and later reﬁned for unoriented tori by Bernard Mans [33].

BIBLIOGRAPHICAL NOTES

213

The quest for a O(n) election protocol for hypercubes with dimensional labelings was solved independently by Steven Robbins and Kay Robbins [40], Paola Flocchini and Bernard Mans [16], and Gerard Tel [43]. Stefan Dobrev [13] has designed a protocol that allows O(n) election in hypercubes with any sense of direction, not just the dimensional labeling (Exercise 3.10.63). The protocol for unoriented hypercubes has been designed by Stefan Dobrev and Peter Ruzicka [14]. The ﬁrst optimal ⌰(n log n) protocol for complete networks was developed by Pierre Humblet [23]; an optimal protocol that requires O(n) messages on the average (Exercise 3.10.74) was developed by Mee Yee Chan and Francis Chin [10]. The lower bound is due to Ephraim Korach, Shlomo Moran, and Shmuel Zaks [26], who also designed another optimal protocol. The optimal protocol CompleteElect, reducing the O(n log n) time complexity to O(n), was designed by Yeuda Afek and Eli Gafni [2]; the same bounds were independently achieved by Gary Peterson [38]. The time complexity has been later reduced to O( logn n ) without increasing the message costs (Exercise 3.10.68) by Gurdip Singh [42]. The fact that a chordal labeling allows to fully exploit the communication power of the complete graph was observed by Michael Loui, Teresa Matsushita, and Douglas West, who developed the ﬁrst O(n) protocol for such a case [32]. Stefan Dobrev [13] has designed a protocol that allows O(n) election in complete networks with any sense of direction, not just the chordal labeling (Exercise 3.10.75). Election protocols for chordal rings, including the doublecube, were designed and analyzed by Hagit Attiya, Jan van Leeuwen, Nicola Santoro, and Shmuel Zaks [3]. The quest for the smallest cord structure has seen k being reduced from O(log n) ﬁrst to O(log log n) by T.Z. Kalamboukis and S.L. Mantzaris [24], then to O(log log log n) by Yi Pan [37], and ﬁnally to O(1) (Problem 3.10.12) by Andreas Fabri and Gerard Tel [unpublished]. The observation that in such a chordal ring, election can be done in O(n) messages even if the links are arbitrarily labeled (Problem 3.10.13) is due to Bernard Mans [33]. The ﬁrst O(m + n log n) universal election protocol was designed by Robert Gallager [18]. Some of the ideas developed there were later used in MegaMerger, developed by Robert Gallager, Pierre Humblet, and Philip Spira, that actually constructs a min-cost spanning tree [19]. The O(n log n) time complexity of MegaMerger has been reduced ﬁrst to O(n log∗ n) by Mee Yee Chan and Francis Chin [11] and then to O(n) (Problem 3.10.14) by Baruch Awerbuch [4] without increasing the message complexity. It has been further reduced to ⌰(d) (Problem 3.10.15) by Hosame AbuAmara and Arkady Kanevsky but at the expense of a O(m log d) message cost [1]; the same reduction has been obtained independently by Juan A. Garay, Shay Kutten, and David Peleg [20]. Protocol YO-YO was designed by Nicola Santoro ; the proof that it requires at most O(log n) stages is due to Gerard Tel. The computational relationship between the traversal and the election problems has been discussed and analyzed by Ephraim Korach, Shay Kutten, and Shlomo Moran [25]. The ⍀(m + n log n) lower bound for universal election as well as some of the other computational equivalence relationships were ﬁrst observed by Nicola Santoro [41].

214

ELECTION

3.10 EXERCISES, PROBLEMS, AND ANSWERS 3.10.1 Exercises Exercise 3.10.1 Modify protocol MinF-Tree (presented in Section 2.6.2) so as to implement strategy Elect Minimum Initiator in a tree. Prove its correctness and analyze its costs. Show that, in the worst case, it uses 3n + k − 4 ≤ 4n − 4 messages. Exercise 3.10.2 Design an efﬁcient single-initiator protocol to ﬁnd the minimum value in a ring. Prove its correctness and analyze its costs. Exercise 3.10.3 Show that the time costs of protocol All the Way will be at most 2n − 1. Determine also the minimum cost and the condition that will cause it. Exercise 3.10.4 Initiator.

Modify protocol All the Way so to use strategy Elect Minimum

Exercise 3.10.5 Modify protocol AsFar so to use strategy Elect Minimum Initiator. Determine the average number of messages assuming that any subset of k∗ entities is equally likely to be the initiators. Exercise 3.10.6 Expand the rules of protocol Stages described in Section 3.3.4, so as to enforce message ordering. Exercise 3.10.7 Show that in protocol Stages, there will be at most one enqueued message per closed port. Exercise 3.10.8 Prove that in protocol Stages with Feedback, the minimum distance between two candidates in stage i is d(i) ≥ 2i−1 . Exercise 3.10.9 Show an initial conﬁguration for n = 8 in which protocol Stages will require the most messages. Describe how to construct the “worst conﬁguration” for any n. Exercise 3.10.10 Determine the ideal time complexity of protocol Stages. Exercise 3.10.11 Modify protocol Stages using the min-max approach discussed in Section 3.3.7. Prove its correctness. Show that its message costs are unchanged. Exercise 3.10.12 Write the rules of protocol Stages* described in Section 3.3.4. Exercise 3.10.13 Assume that in Stages* candidate x in stage i receives a message M∗ with stage j > i. Prove that if x survives, then id(x) is smaller not only of id∗ but also of the ids in the messages “jumped over” by M∗. Exercise 3.10.14 Show that protocol Stages* correctly terminates.

EXERCISES, PROBLEMS, AND ANSWERS

215

Exercise 3.10.15 Prove that the message and time costs of Stages* are no worse that those of Stages. Produce an example in which the costs of Stages* are actually smaller. Exercise 3.10.16 Write the rules of protocol Stages with Feedback assuming message ordering. Exercise 3.10.17 Derive the ideal time complexity of protocol Stages with Feedback. Exercise 3.10.18 Write the rules of protocol Stages with Feedback enforcing message ordering. Exercise 3.10.19 Prove that in protocol Stages with Feedback, the number of ring segments where no feedback will be transmitted in stage i is ni+1 . Exercise 3.10.20 Prove that in protocol Stages with Feedback, the minimum distance between two candidates in stage i is d(i) ≥ 3i−1 . Exercise 3.10.21 Give a more accurate estimate of the message costs of protocol Stages with Feedback. Exercise 3.10.22 Show an initial conﬁguration for n = 9 in which protocol Stages with Feedback will require the most stages. Describe how to construct the “worst conﬁguration” for any n. Exercise 3.10.23 Modify protocol Stages with Feedback using the min-max approach discussed in Section 3.3.7. Prove its correctness. Show that its message costs are unchanged. Exercise 3.10.24 Implement the alternating step strategy under the same restrictions and with the same cost of protocol Alternate but without closing any port. Exercise 3.10.25 Determine initial conﬁgurations that will force protocol Alternate to use k steps when n = Fk . Exercise 3.10.26 Show that the worst case number of steps of protocol Alternate is achievable for every n > 4. Exercise 3.10.27 Determine the ideal time complexity of protocol Alternate. Exercise 3.10.28 Modify protocol Alternate using the min-max approach discussed in Section 3.3.7. Prove its correctness. Show that its message costs are unchanged. Exercise 3.10.29 Show the step-by-step execution of Stages and of UniStages in the ring of Figure 3.3. Indicate for each step, the values know at the candidates.

216

ELECTION

Exercise 3.10.30 Determine the ideal time complexity of protocol UniStages. Exercise 3.10.31 Modify protocol UniStages using the min-max approach discussed in Section 3.3.7. Prove its correctness. Show that its message costs are unchanged. Exercise 3.10.32 Design an exact simulation of Stages with Feedback for unidirectional rings. Analyze its costs. Exercise 3.10.33 Show the step-by-step execution of Alternate and of UniAlternate in the ring of Figure 3.3. Indicate for each step, the values know at the candidates. Exercise 3.10.34 Without changing its message cost, modify protocol UniAlternate so that it does not require Message Ordering. Exercise 3.10.35 Prove that the ideal time complexity of protocol UniAlternate is O(n). Exercise 3.10.36 Modify protocol UniAlternate using the min-max approach discussed in Section 3.3.7. Prove its correctness. Show that its message costs are unchanged. Exercise 3.10.37 Prove that in protocol MinMax, if a candidate x survives an even stage i, its predecessor l(i, x) becomes defeated. Exercise 3.10.38 Show that the worst case number of steps of protocol MinMax is achievable. Exercise 3.10.39 Modify protocol MinMax so that it does not require Message Ordering. Implement your modiﬁcation and throughly test your implementation. Exercise 3.10.40 For protocol MinMax, consider the conﬁguration depicted in Figure 3.32. Prove that once envelope (11, 3) reaches the defeated node z, z can determine that 11 will survive this stage. Exercise 3.10.41 Write the rules of Protocol MinMax+ assuming message ordering. Exercise 3.10.42 Write the rules of Protocol MinMax+ without assuming message ordering. Exercise 3.10.43 Prove Property 3.3.1. Exercise 3.10.44 Prove that in protocol MinMax+, if an envelope with value v reaches an even stage i + 1, it saves at least Fi messages in stage i with respect to MinMax (Hint: Use Property 3.3.1.).

EXERCISES, PROBLEMS, AND ANSWERS

217

Exercise 3.10.45 Prove that even if the entities know n, aveA (I |n known) ≥ ( 41 − ) n log n for any election protocol A for unidirectional rings. Exercise 3.10.46 Prove that in bidirectional rings, aveA (I ) ≥ protocol A.

1 2

nHn for any election

Exercise 3.10.47 Prove that even if the entities know n, aveA (I |n known) ≥ 21 n log n for any election protocol A for unidirectional rings. Exercise 3.10.48 Determine the exact complexity of Wake-Up in a mesh of dimensions a × b. Exercise 3.10.49 Show how to broadcast from a corner of a mesh dimensions a × b with less than 2n messages. Exercise 3.10.50 In Protocol ElectMesh, in the ﬁrst stage of the election process, if an interior node receives an election message, it will reply to the sender “I am in the interior,” so that no subsequent election messages are sent to it. Explain why it is possible to achieve the same goal without sending those replies. Exercise 3.10.51 Consider the following simple modiﬁcation to Protocol ElectMesh: When sending a wake-up message, a node includes the information of whether it is an internal, a border, or a corner node. Then, during the ﬁrst stage of the election, a border node uses this information if possible to send the election message only along the outer ring (it might not be possible.). Show that the protocol so modiﬁed uses at most 4(a + b) + 5n + k − 32 messages. Exercise 3.10.52 Broadcasting in Oriented Mesh. Design a protocol that allows to broadcast in an oriented mesh using n − 1 messages regardless of the location of the initiator. Exercise 3.10.53 Traversal in Oriented Mesh. Design a protocol that allows to traverse an oriented mesh using n − 1 messages regardless of the location of the initiator. Exercise 3.10.54 Wake-Up in Oriented Mesh. Design a protocol that allows to wake-up all the entities in an oriented mesh using less than 2n messages regardless of the location and the number of the initiators. Exercise 3.10.55 Show that the effect of rounding up α i does not affect the order of magnitude of the cost of Protocol MarkBorder derived in Section 3.4.2 (Hint: Show that it amounts to at most eight extra messages per candidate per stage with an insigniﬁcant change in the bound on the number of candidates in each stage).

218

ELECTION

Exercise 3.10.56 Show that the ideal time of protocol MarkBorder can be as bad as O(n). Exercise 3.10.57 Improving √ Time in Tori () Modify Protocol MarkBorder so that the time complexity is O( n) without increasing the message complexity. Ensure that the modiﬁed protocol is correct. Exercise 3.10.58 Election in Rectangular Torus () Modify Protocol MarkBorder so that it elects a leader in a rectangular torus of dimension l × w (l ≤ w), using ⌰(n + l log l/w) messages. Exercise 3.10.59 Determine the cost of electing a leader in an oriented hypercube if in protocol HyperElect the propagation of the Match messages is done by broadcasting in the appropriate subcube instead of “compressing the address.” Exercise 3.10.60 Prove that in protocol HyperElect the distance d(j − 1, j ) between wj −1 (z) and wj (z) is at most j . Exercise 3.10.61 Prove Lemma 3.5.1, that is, that during the execution of protocol HyperElect, the only duelists in stage i are the entities with the smallest id in one of the hypercubes of dimension i − 1 in Hk:i−1 . Exercise 3.10.62 O(log3 N).

Show that the time complexity of Protocol HyperFlood is

Exercise 3.10.63 () Prove that it is possible to elect a leader in a hypercube using O(n) messages with any sense of direction (Hint: Use long messages). Exercise 3.10.64 Prove that in the strategy CompleteElect outlined in Section 3.6.1, the territories of any two candidates in the same stage have no nodes in common. Exercise 3.10.65 Prove that the strategy CompleteElect outlined in Section 3.6.1 solves the election problem. Exercise 3.10.66 Determine the cost of the strategy CompleteElect described in Section 3.6.1 in the worst case (Hint: Consider how many candidates there can be at level i). Exercise 3.10.67 Analyze the ideal time cost of protocol CompleteElect described in Section 3.6.1. Exercise 3.10.68 Design an election protocol for complete graphs that, like CompleteElect, uses O(n log n) messages but uses only O(n/ log n) time in the worst case.

EXERCISES, PROBLEMS, AND ANSWERS

219

Exercise 3.10.69 Generalize the answer to Exercise 3.10.68. Design an election protocol for complete graphs that, for any log n ≤ k ≤ n, uses O(nk) messages and O(n/k) time in the worst case. Exercise 3.10.70 Prove that all the rings R(2), . . . , R(k) where messages are sent by protocol Kelect do not have links in common. Exercise 3.10.71 Write the code for, implement, and test protocol Kelect-Stages. Exercise 3.10.72 () Consider using the ring protocol Alternate instead of Stages in Kelect. Determine what will be the cost in this case. Exercise 3.10.73 () Stages.

Determine the average message costs of protocol Kelect-

Exercise 3.10.74 () Show how to elect a leader in a complete network with O(n log n) messages in the worst case but only O(n) on the average. Exercise 3.10.75 () Prove that it is possible to elect a leader in a complete graph using O(n) messages with any sense of direction. Exercise how to elect a leader in the chordal ring Cn 1, 2, 3, 4..., t 3.10.76 Show with O n + nt log nt messages. Exercise 3.10.77 Prove that in chordal ring Cnt electing a leader requires at least n n ⍀ n + t log t messages in the worst case (Hint: Reduce the problem to that of electing a leader on a ring of size n/t). Exercise 3.10.78 Show how to elect a leader in the double cube Cn 1, 2, 4, 8..., 2 log n with O(n) messages. Exercise 3.10.79 Consider a merger message from city A arriving at neighbouring city B along merge link (a, b) in protocol Mega-Merger. Prove that if we reverse the logical direction of the links on the path from D(A) to the exit point a and direct toward B the merge link, the union of A and B will be rooted in the downtown of A. Exercise 3.10.80 District b of B has just received a Let-us-Merge message from a along merge link (a, b). From the message, b ﬁnds out that level(A) > level(B); thus, it postpones the request. In the meanwhile, the downtown D(B) chooses (a, b) as its merge link. Explain why this situation will never occur. Exercise 3.10.81 Find a way to avoid notiﬁcation of termination by the downtown of the megacity in protocol Mega-Merger (Hint: Show that by the time the downtown understands that the mega-merger is completed, all other districts already know that their execution of the protocol is terminated).

220

ELECTION

Exercise 3.10.82 Time Costs. Show that protocol Mega-Merger uses at most O(n log n) ideal time units. Exercise 3.10.83 Prove that in the YO-YO protocol, during an iteration, no sink or internal node will become a source. Exercise 3.10.84 Modify the YO-YO protocol so that upon termination, a spanning tree rooted in the leader has been constructed. Achieve this goal without any additional messages. Exercise 3.10.85 every link.

Prove that to solve SPT under IR, a message must be sent on

Exercise 3.10.86 Show how to transform a spanning-tree construction algorithm C so as to elect a leader with at most O(n) additional messages. Exercise 3.10.87 Prove that under IR, the problem of ﬁnding the smallest of the entities’ values is computationally equivalent to electing a leader and has the same message complexity. 3.10.2 Problems Problem 3.10.1 Josephus Problem. Consider the following set of electoral rules. In stage i, a candidate x sends its id and receives the id from its two neighboring candidates, r(i, x) and l(i, x): x does not survive this stage if and only if its id is larger than both received ids. Analyze the corresponding protocol Josephus, determining in particular the number of stages and the total number of messages both in the worst and in the average case. Analyze and discuss its time complexity. Problem 3.10.2 Alternating Steps () Design a conﬂict resolution mechanism for the alternating steps strategy to cope lack of orientation in the ring. Analyze the complexity of the resulting protocol Problem 3.10.3 Better Stages () Construct a protocol based on electoral stages c that guarantees ni ≤ ni−1 b with cn messages transmitted in each stage, where log b < 1.89. Problem 3.10.4 Bidirectional MinMax () Design a bidirectional version of MinMax with the same costs. Problem 3.10.5 Distances in MinMax+ () In computing the cost of protocol MinMax+ we have used dis(i) = Fi+2 . Determine what will be the cost if we use dis(i) = 2i instead.

EXERCISES, PROBLEMS, AND ANSWERS

221

Problem 3.10.6 MinMax+ Variations () In protocol MinMax+ we use “promotion by distance” only in the even stages and “promotion by witness” only in the odd stages. Determine what would happen if we use 1. only “promotion by distance” but in every stage; 2. only “promotion by witness” but in every stage; 3. “promotion by distance” in every stage and “promotion by witness” only in odd stages; 4. “promotion by witness” in every stage and “promotion by distance” only in even stages; 5. both “promotion by distance” and “promotion by witness” in every stage. Problem 3.10.7 Bidirectional Oriented Rings. () Prove or disprove that there is an efﬁcient protocol for bidirectional oriented rings that cannot be used nor simulated neither in unidirectional rings nor in general bidirectional ones with the same or better costs. Problem 3.10.8 Unoriented Hypercubes. () Design a protocol that can elect a leader in a hypercube with arbitrary labelling using O(n log log n) messages. Implement and test your protocol. Problem 3.10.9 Linear Election in Hypercubes. () Prove or disprove that it is possible to elect a leader in an hypercube in O(n) messages even when it is not oriented. Problem 3.10.10 Oriented Cube-Connected Cycles () Design an election protocol for an oriented CCC using O(n) messages. Implement and test your protocol. Problem 3.10.11 Oriented Butterﬂy. Design an election protocol for an oriented butterﬂy. Determine its complexity. Implement and test your protocol. Problem 3.10.12 Minimal Chordal Ring () Find a chordal ring with k = 2 where it is possible to elect a leader with O(n) messages. Problem 3.10.13 Unlabelled Chordal Rings () Show how to elect a leader in the chordal ring of Problem 3.10.12 with O(n) messages even if the edges are arbitrarily labeled. Problem 3.10.14 Improved Time () Show how to elect a leader using O(m + n log n) messages but only O(n) ideal time units. Problem 3.10.15 Optimal Time () Show how to elect a leader in O(d) time using at most O(m log d) messages.

222

ELECTION

3.10.3 Answers to Exercises Answer to Exercise 3.10.21 The size of the areas where no feedback is sent in stage i can vary from one another, from stage to stage, and from execution to execution. We can still have an estimate of their size. In fact, the distance di between two candidates in stage i is d(i) ≥ 3i−1 (Exercise 3.10.20). Thus, the total number of message transmissions caused in stage i by the feedback will be at most n − ni+1 3i−1 , yielding a total of at most log n 3n − i=1 3 ni+1 3i−1 messages. Answer to Exercise 3.10.44 Let hj (a) denote the candidate that originated message (a, j ). Consider a message (v, i + 1) and its originator z = hi+1 (v); this message was sent after receiving (v, i) originated by x = hi (v). Let y = hi (u) be the ﬁrst candidate after x in the ring in stage i, and (u, i) the message it originated. As v survives this stage, which is odd (i.e., min), it must be that v < u. Message (v, i) travels from x toward y; upon receiving (v, i), node z in this interval will generate (v, i + 1). Now z cannot be after node hi−1 (u) in the ring because by rule (IV) w = hi−1 (u) would immediately generate (v, i + 1) after receiving (v, i). In other words, either z = w or z is before w. Thus we save at least d(z, y) ≥ d(w, y) = d(hi−1 (u), hi (u)) ≥ Fi , where the last inequality is by Property 3.3.1. Partial Answer to Exercise 3.10.66 Consider a captured node y that receives an attack after the other, say from a candidates x1 in level i. According to the strategy, y will send a Warning to its owner z to inform it of this attack and wait for a reply; depending on the reply, it will notify x1 of whether the attack was successful (the case in which y will be captured by x1 ) or not. Assume now that while waiting, y receives an attack after the other, say from candidates x2 , . . . , xk in that order, all in the same level i. According to the strategy, y will issue a Warning to its owner z for each of them. Observe now that if id(z) > id(x1 ) > . . . > id(xk ), each of these attacks will be successful, and y will in turn be captured by all those candidates. BIBLIOGRAPHY [1] H. Abu-Amara and A. Kanevsky. On the complexities of leader election algorithms. In 5th IEEE International Conference on Computing and Information, pages 202–206, Sudbury, May 1993. [2] Y. Afek and E. Gafni. Time and message bounds for election in synchronous and asynchronous complete networks. SIAM Journal on Computing, 20(2):376–394, 1991. [3] H. Attiya, J. van Leeuwen, N. Santoro, and Shmuel Zaks. Efﬁcient elections in chordal ring networks. Algorithmica, 4:437–446, 1989. [4] B. Awerbuch. Optimal distributed algorithms for minimum weight spanning tree, counting, leader election, and related problems. In 19th Annual ACM Symposium on Theory of Computing, pages 230–240, New York City, May 1987.

BIBLIOGRAPHY

223

[5] H.L. Bodlaender. A better lower bound for distributed leader ﬁnding in bidirectional, asynchronous rings of processors. Information Processing Letters, 27(6):287–290, 1988. [6] H.L. Bodlaender. New lower bound techniques for distributed leader ﬁnding and other problems on rings of processors. Theoretical Computer Science, 81:237–256, 1991. [7] H.L. Bodlaender. Some lower bound results for decentralized extrema-ﬁnding in rings of processors. Journal on Computing and System Sciences, 42(1):97–118, 1991. [8] H.L. Bodlaender and J. van Leeuwen. New upperbounds for distributed extrema-ﬁnding in a ring of processors. In Proc. 1st International Workshop on Distributed Algorithms (WDAG 1), pages 504–512, Ottawa, Aug 1985. [9] J. Burns. A formal model for message passing systems. Technical Report UTR-91, Indiana University, 1981. [10] M.Y. Chan and F.L.Y. Chin. Distributed election in complete networks. Distributed Computing, 3(1):19–22, 1988. [11] M.Y. Chan and F.L.Y. Chin. Improving the time complexity of message-optimal distributed algorithms for minimum-weight spanning trees. SIAM Journal on Computing, 19(4):612– 626, 1990. [12] E.J.H. Chang and R. Roberts. An improved algorithm for decentralized extrema-ﬁnding in circular conﬁgurations of processes. Communications of the ACM, 22(5):281–283, May 1979. [13] S. Dobrev. Leader election using any sense of direction. In 6th International Colloquium on Structural Information and Communication Complexity, pages 93–104, Lacanau, July 1999. [14] S. Dobrev and P. Ruzicka. Linear broadcasting and O(n log log n) election in unoriented hypercubes. In 4th International Colloquium on Structural Information and Communication Complexity, pages 53–68, Ascona, July 1997. [15] D. Dolev, M. Klawe, and M. Rodeh. An O(n log n) unidirectional algorithm for extremaﬁnding in a circle. Journal of Algorithms, 3:245–260, 1982. [16] P. Flocchini and B. Mans. Optimal elections in labeled hypercubes. Journal of Parallel and Distributed Computing, 33(1):76–83, 1996. [17] W.R. Franklin. On an improved algorithm for decentralized extrema-ﬁnding in a circular conﬁguration of processes. Communications of the ACM, 25(5):336–337, May 1982. [18] R.G. Gallager. Finding a leader in a network with O(e) + O(n log n) messages. Technical Report Internal Memo, M.I.T., 1979. [19] R.G. Gallager, P.A. Humblet, and P.M. Spira. A distributed algorithm for minimum spanning tree. ACM Transactions on Programming Languages and Systems, 5(1):66–77, 1983. [20] J. A. Garay, S. Kutten, and D. Peleg. A sublinear time distributed algorithm for minimumweight spanning trees. SIAM Journal on Computing, 27(1):302–316, February 1998. [21] L. Higham and T. Przytycka. A simple, efﬁcient algorithm for maximum ﬁnding on rings. Information Processing Letters, 58:319–324, 1996. [22] D.S. Hirschberg and J.B. Sinclair. Decentralized extrema ﬁnding in circular conﬁgurations of processors. Communications of the ACM, 23:627–628, 1980. [23] P.A. Humblet. Selecting a leader in a clique in O(n log n) messages. In Proc. 23rd Conf. on Decision and Control, pages 1139–1140, Las Vegas, Dec. 1984. [24] T.Z. Kalamboukis and S.L. Mantzaris. Towards optimal distributed election on chordal rings. Information Processing Letters, 38(5):265–270, 1991.

224

ELECTION

[25] E. Korach, S. Kutten, and S. Moran. A modular technique for the design of efﬁcient distributed leader ﬁnding algorithms. ACM Transactions on Programming Languages and Systems, 12(1):84–101, January 1990. [26] E. Korach, S. Moran, and S. Zaks. Optimal lower bounds for some distributed algorithms for a complete network of processors. Theoretical Computer Science, 64:125–132, 1989. [27] E. Korach, D. Rotem, and N. Santoro. Distributed election in a circle without a global sense of orientation. International Journal of Computer Mathematics, 16:115–124, 1984. [28] E. Korach, D. Rotem, and N. Santoro. Analysis of a distributed algorithm for extrema ﬁnding in a ring. Journal of Parallel and Distributed Computing, 4:575–591, 1987. [29] G. Le Lann. Distributed systems: Toward a formal approach. In IFIP Conference on Information Processing, pages 155–160, 1977. [30] C. Lavault. Average number of messages for distributed leader-ﬁnding in rings of processors. Information Processing Letters, 30(4):167–176, 1989. [31] C. Lavault. Exact average message complexity values for distributed election on bidirectional rings of processors. Theoretical Computer Science, 73(1):61–79, 1990. [32] M.C. Loui, T.A. Matsushita, and D.B. West. Election in complete networks with a sense of direction. Information Processing Letters, 22:185–187, 1986. see also Information Processing Letters, vol.28:327, 1988. [33] B. Mans. Optimal distributed algorithms in unlabeled tori and chordal rings. Journal of Parallel and Distributed Computing, 46(1):80–90, 1997. [34] F. Mattern. Message complexity of simple ring-based election algorithms-an empirical analysis. In 9th IEEE International Conference on Distributed Computing Systems, pages 94–100, 1989. [35] S. Moran, M. Shalom, and S. Zaks. An 1.44...n log n algorithm for distributed leader ﬁnding in bidirectional rings of processors. Technical Report RC 11933, IBM Research Division, 1986. [36] J. Pachl, D. Rotem, and E. Korach. Lower bounds for distributed maximum ﬁnding algorithms. Journal of the ACM, 31:905–917, 1984. [37] Y. Pan. An improved election algorithm in chordal ring networks. International Journal of Computer Mathematics, 40(3-4):191–200, 1991. [38] G.L. Peterson. Improved algorithms for elections in meshes and complete networks. Technical report, Georgia Institute of Techchnology, December 1986. [39] G.L. Peterson. An O(n log n) unidirectional algorithm for the circular extrema problem. A.C.M. Transactions on Programming Languages and Systems, 4(4):758–762, oct 1982. [40] S. Robbins and K.A. Robbins. Choosing a leader on a hypercube. In N. Rishe, S. Najathe, and D. Tal, editors, PARBASE-90, International Conference on Databases, Parallel Aarchitectures and their Applications, pages 469–471, Miami Beach, 1990. [41] N. Santoro. On the message complexity of distributed problems. Journal of Computing and Information Sciences, 13:131–147, 1984. [42] G. Singh. Leader election in complete networks. SIAM Journal on Computing, 26(3):772– 785, 1997. [43] G. Tel. Linear election in oriented hypercubes. Parallel Processing Letters, 5:357–366, 1995. [44] J. van Leeuwen and R.B. Tan. An improved upperbound for distributed election in bidirectional rings of processors. Distributed Computing, 2(3):149–160, 1987.

CHAPTER 4

Message Routing and Shortest Paths

4.1 INTRODUCTION Communication is at the base of computing in a distributed environment, but the task to achieve it efﬁciently is neither simple nor trivial. Consider an entity x that wants to communicate some information to another entity y; for example, x has a message that it wants to be delivered to y. In general, x does not know where y is or how to reach it (i.e., which paths lead to it); actually, it might not even know if y is a neighbor or not. is strongly connected. Still, the communication is always possible if the network G In fact, it is sufﬁcient for x to broadcast the information: every entity, including y will receive it. This simple solution, called broadcast routing, is obviously not efﬁcient; on the contrary, it is impractical, expensive in terms of cost, and not very secure (too many other nodes receive the message), even if it is performed only on a spanning-tree of the network. from x to y: The message A more efﬁcient approach is to choose a single path in G sent by x will travel along this path only, relayed by the entities in the path, until it reaches its destination y. The process of determining a path between a source x and a destination y is known as routing. If there is more than one path from x to y, we would obviously like to choose the “best” one, that is, the least expensive one. The cost θ(a, b) ≥ 0 of a link (a, b), traditionally called length, is a value that depends on the system (reﬂecting, e.g., time delay, transmission cost, link reliability, etc.), and the cost of a path is the sum of the costs of the links composing it. The path of minimum cost is called shortest path; clearly, the objective is to use this path for sending the message. The process of determining the most economic path between a source and a destination is known as shortest-path routing. The (shortest-path) routing problem is commonly solved by storing at each entity x the information that will allow to address a message to its destination through a (shortest) path. This information is called routing table. In this chapter we will discuss several aspects of the routing problem. First of all, we will consider the construction of the routing tables. We will then address Design and Analysis of Distributed Algorithms, by Nicola Santoro Copyright © 2007 John Wiley & Sons, Inc.

225

226

MESSAGE ROUTING AND SHORTEST PATHS

h

3

5

k

1

f

5

e

10

8 c

2

d

(a)

3

5

k

1

3

3

s

h

f

3

3 5

s

e

10

8 c

2

d

(b)

FIGURE 4.1: Determining the shortest paths from s to the other entities.

the problem of maintaining the information of the tables up to date, should changes occur in the system. Finally, we will discuss how to represent routing information in a compact way, suitable for systems where space is a problem. In the following, and unless otherwise speciﬁed, we will assume the set of restrictions IR: Bidirectional Links (BL), Connectivity (CN), Total Reliability (TR), and Initial Distinct Values (ID).

4.2 SHORTEST PATH ROUTING The routing table of an entity contains information on how to reach any possible destination. In this section we examine how this information can be acquired, and the table constructed. As we will see, this problem is related to the construction of particular spanning-trees of the network. In the following, and unless otherwise speciﬁed, we will focus on shortest-path routing. Different types of routing tables can be deﬁned, depending on the amount of information contained in them. We will consider for now the full routing table: For each destination, there is stored a shortest path to reach it; if there are more than one shortest path, only the lexicographically smallest1 will be stored. For example, in the network of Figure 4.1, the routing table RT(s) for s is shown in Table 4.1. We will see different approaches to construct routing tables, some depending on the amount of local storage an entity has available. 4.2.1 Gossiping the Network Maps A ﬁrst obvious solution would be to construct at every entity the entire map of the network with all the costs; then, each entity can locally and directly compute its shortest-path routing table. This solution obviously requires that the local memory available to an entity is large enough to store the entire map of the network. 1

The lexicographic order will be over the strings of the names of the nodes in the paths.

SHORTEST PATH ROUTING

227

TABLE 4.1: Full Routing Table for Node s Routing Destination

Shortest Path

Cost

h k c d e f

(s, h) (s, h)(h, k) (s, c) (s, c)(c, d) (s, e) (s, e)(e, f )

1 4 10 12 5 8

The map of the network can be viewed as an n × n array MAP(G), one row and one column per entity, where for any two entities x and y, the entry MAP[x, y] contains information on whether link (x, y) exists, and if so on its cost. In a sense, each entity x knows initially only its own row MAP[x, ]. To know the entire map, every entity needs to know the initial information of all the other entities. This is a particular instance of a general problem called input collection or gossip: every entity has a (possibly different) piece of information; the goal is to reach a ﬁnal conﬁguration where every entity has all the pieces of information. The solution of the gossiping problem using normal messages is simple: every entity broadcasts its initial information. Since it relies solely on broadcast, this operation is more efﬁciently performed in a tree. Thus, the protocol will be as follows: Map Gossip: 1. An arbitrary spanning tree of the network is created, if not already available; this tree will be used for all communication. 2. Each entity acquires full information about its neighborhood (e.g., names of the neighbors, cost of the incident links, etc.), if not already available. 3. Each entity broadcasts its neighborhood information along the tree. At the end of the execution, each entity has a complete map of the network with all the link costs; it can then locally construct its shortest-path routing table. The construction of the initial spanning-tree can be done using O(m + n log n) messages, for example using protocol MegaMerger. The acquisition of neighborhood information requires a single exchange of messages between neighbors, requiring in total just 2m messages. Each entity x then broadcasts on the tree deg(x) items of information. Hence the total number of messages will be at most x

deg x n − 1 = 2m n − 1 .

Thus, we have M[Map Gossip] = 2 m n + l.o.t.

(4.1)

228

MESSAGE ROUTING AND SHORTEST PATHS

This means that, in sparse networks, all the routing tables can be constructed with at most O(n2 ) normal messages. Such is the case of meshes, tori, butterﬂies, and so forth. In systems that allow very long messages, not surprisingly the gossip problem, and thus the routing table construction problem, can be solved with substantially fewer messages (Exercises 4.6.3 and 4.6.4). The time costs of gossiping on a tree depend on many factors, including the diameter of the tree and the number of initial items an entity initially has (Exercise 4.6.2). 4.2.2 Iterative Construction of Routing Tables The solution we have just seen requires that each entity has locally available enough storage to store the entire map of the network. If this is not the case, the problem of constructing the routing tables is more difﬁcult to resolve. Several traditional sequential methods are based on an iterative approach. Initially, each entity x knows only its neighboring information: for each neighbor y, the entity knows the cost θ(x, y) of reaching it using the direct link (x, y). On the basis of this initial information, x can construct an approximation of its routing table. This imperfect table is usually called distance vector, and in it the cost for those destinations x knows nothing about will be set to ∞. For example, the initial distance vector for node s in the network of Figure 4.1 is shown in Table 4.2. This approximation of the routing table will be reﬁned, and eventually corrected, through a sequence of iterations. In each iteration, every entity communicates its current distance vector with all its neighbors. On the basis of the received information, each entity updates its current information, replacing paths in its own routing table if the neighbors have found better routes. How can an entity x determine if a route is better ? The answer is very simple: when, in an iteration, x is told by a neighbor y that there exists a path π2 from y to z with cost g2 , x checks in its current table the path π1 to z and its cost g1 , as well as the cost θ (x, y). If θ(x, y) + g2 < g1 , then going directly to y and then using π2 to reach z is less expensive than going to z through the path π1 currently in the table. Among several better choices, obviously x will select the best one.

TABLE 4.2: Initial Approximation of RT(s) Routing Destination

Shortest Path

Cost

h k c d e f

(s, h) ? (s, c) ? (s, e) ?

1 ∞ 10 ∞ 5 ∞

SHORTEST PATH ROUTING

TABLE 4.3: Initial Distance Vectors s h k c d s h k c d e f

1 ∞ 10 ∞ 5 ∞

1 3 ∞ ∞ ∞ ∞

∞ 3 ∞ ∞ 3 5

10 ∞ ∞ 2 ∞ ∞

∞ ∞ ∞ 2 8 ∞

e

f

5 ∞ 3 ∞ 8 3

∞ ∞ 5 ∞ ∞ 3 -

229

Speciﬁcally, let Vyi [z] denote the cost of the “best” path from y to z known to y in iteration i; this information is contained in the distance vector sent by y to all its neighbors at the beginning of iteration i + 1. After sending its own distance vector and upon receiving the distance vectors of all its neighbors, entity x computes w[z] = Miny∈N(x) (θ(x, y) + Vyi [z]) for each destination z. If w[z] < Vxi [z], then the new cost and the corresponding path to z is chosen, replacing the current selection. Why should interaction just with the neighbors be sufﬁcient follows from the fact that the cost γa (b) of the shortest path from a to b has the following deﬁning property: Property 4.2.1 γa (b) =

0 if a = b Minw∈N(a) {θ(a, w) + γw (b)} otherwise.

The Protocol Iterated Construction based on this strategy converges to the correct information and will do so after at most n − 1 iterations (Exercise 4.6.8). For example, in the graph of Figure 4.1, the process converges to the correct routing tables after only two iterations; see Tables 4.3–4.5 : for each entity, only the cost information for every destination is displayed. The main advantage of this process is that the amount of storage required at an entity is proportional to the size of the routing table and not to the map of the entire system. TABLE 4.4: Distance Vectors After First Iteration s h k c d e f s h k c d e f

1 4 10 12 5 8

1 3 11 ∞ 6 8

4 3 ∞ 11 3 5

10 11 ∞ 2 10 ∞

12 ∞ 11 2 8 11

5 6 3 10 8 3

8 8 5 ∞ 11 3 -

230

MESSAGE ROUTING AND SHORTEST PATHS

TABLE 4.5: Distance Vectors After Second Iteration s h k c d e f s h k c d e f

1 4 10 12 5 8

1 3 11 13 6 8

4 3 13 11 3 5

10 11 13 2 10 13

12 13 11 2 8 11

5 6 3 10 8 3

8 8 5 13 11 3 -

Let us analyze the message and time costs of the associated protocol. In each iteration, an entity sends its distance vector containing costs and path information; actually, it is not necessary to send the entire path but only the ﬁrst hop in it (see discussion in Section 4.4). In other words, in each iteration, an entity x needs to send n items of information to its deg(x) neighbors. Thus, in total, an iteration requires 2nm messages. As this process terminates after at most n − 1 iterations, we have M[Iterated Construction] = 2 (n − 1) n m.

(4.2)

That is, this approach is more expensive than the one based on constructing all the maps; it does, however, require less local storage. As for the time complexity, let τ (n) denote the amount of ideal time required to transmit n items of information to the same neighbor; then T[Iterated Construction] = (n − 1) τ (n).

(4.3)

Clearly, if the system allows very long messages, the protocol can be executed with fewer messages. In particular, if messages containing O(n) items of information (instead of O(1)) are possible, then in each iteration an entity can transmit its entire distance vector to a neighbor with just one message and τ (n) = 1. The entire process can thus be accomplished with O(n, m) messages and the time complexity would then be just n − 1. 4.2.3 Constructing Shortest-Path Spanning Tree The ﬁrst solution we have seen, protocol Map Gossip, requires that each entity has locally available enough storage to store the entire map of the network. The second solution, protocol Iterative Construction, avoids this problem, but it does so at the expense of a substantially increased amount of messages. Our goal is to design a protocol that, without increasing the local storage requirements, constructs the routing tables with a smaller amount of communication. Fortunately, there is an important property that will help us in achieving this goal.

SHORTEST PATH ROUTING

231

Consider the paths contained in the full routing table RT(s) of an entity s, for example, the ones in Table 4.1.These paths deﬁne a subgraph of the network (as not every link is included). This subgraph is special: It is connected, contains all the nodes, and does not have cycles (see Figure 4.1 where the subgraph links are in bold); in other words, it is a spanning tree! It is called the shortest path spanning tree rooted in s(PT(s)), sometimes also known as the sink tree of s. This fact is important because it tells us that, to construct the routing table RT(s) of s, we just need to construct the shortest path spanning tree PT(s). Protocol Design To construct the shortest path spanning tree PT(s), we can adapt a classical serial strategy for constructing PT(s) starting from the source s: Serial Strategy We are given a connected fragment T of PT(s), containing s (initially, T will be composed of just s). Consider now all the links going outside of T (i.e., to nodes not yet in T). To each such link (x, y) associate the value v(x, y) = γs (x) + θ (x, y), that is, v(x, y) is the cost of reaching y from the source s by ﬁrst going to x (through a shortest path) and then using the link (x, y) to reach y. Add to T the link (a, b) for which v(a, b) is minimum; in case of a tie, choose the one leading to the node with the lexicographically smallest name. The reason this strategy works is because of the following property: Property 4.2.2 Let T and (a, b) be as deﬁned in the serial strategy. Then T ∪ (a, b) is a connected fragment T of PT(s). That is, the new tree, obtained by adding the chosen (a, b) to T, is also a connected fragment of PT(s), containing s, and it is clearly larger than T. In other words, using this strategy, the shortest path spanning-tree PT(s) will be constructed, starting from s, by adding the appropriate links, one at the time. The algorithm based on this strategy will be a sequence of iterations started from the root. In each iteration, the outgoing link (a, b) with minimum cost v(a, b) is chosen; the link (a, b) and the node b are added to the fragment, and a new iteration is started. The process terminates when the fragment includes all the nodes. Our goal is now to implement this algorithm efﬁciently in a distributed way. First of all, let us consider what a node y in the fragment T knows. Deﬁnitely y knows which of its links are part of the current fragment; it also knows the length γs (y) of the shortest path from the source s to it.

232

MESSAGE ROUTING AND SHORTEST PATHS

IMPORTANT. Let us assume for the moment that y also knows which of its links are outgoing (i.e., lead to nodes outside of the current fragment) and which are internal. In this case, to ﬁnd the outgoing link (a, b) with minimum cost v (a, b) is rather simple, and the entire iteration is composed of four easy steps: Iteration 1. The root s broadcasts in T the start of the new iteration. 2. Upon receiving the start, each entity x in the current fragment T computes locally v(x, y)= γs (x) + θ (x, y) for each of its outgoing incident links (x, y); it then selects among them the link e = (x, y ) for which v(x, y ) is minimized. 3. The overall minimum v(a, b) among all the locally selected v(e)’s is computed at s, using a minimum-ﬁnding for (rooted) trees (e.g., see Section 2.6.7), and the corresponding link (a, b) is chosen as the one to be added to the fragment. 4. The root s notiﬁes b of the selection; the link (a, b) is added to the spanning-tree; b computes γs (b), and s is notiﬁed of the end of the iteration. Each iteration can be performed efﬁciently, in O(n) messages, as each operation (broadcast, min-ﬁnding, notiﬁcations) is performed on a tree of at most n nodes. There are a couple of problems that need to be addressed. A small problem is how can b compute γs (b). This value is actually determined at s by the algorithm in this iteration; hence, s can communicate it to b when notifying it of its selection. A more difﬁcult problem regards the knowledge of which links are outgoing (i.e., they lead to nodes outside of the current fragment); we have assumed that an entity in T has such a knowledge about its links. But how can such a knowledge be ensured? As described, during an iteration, messages are sent only on the links of T and on the link selected in that iteration. This means that the outgoing links are all unexplored (i.e., no message has been sent or received on them). As we do not know which are outgoing, an entity could perform the computation of step 2 for each of its unexplored incident links and select the minimum among those. Consider for example the graph of Figure 4.2(a) and assume that we have already constructed the fragment shown in Figure 4.2(b). There are four unexplored links incident to the fragment (shown as leading to square boxes), each with its value (shown in the corresponding square box); the link (s, e) among them has minimum value and is chosen; it is outgoing and it is added to the segment. The new segment is shown in Figure 4.2(c) together with the unexplored links incident on it. However, not all unexplored links are outgoing: An unexplored link might be internal (i.e., leading to a node already in the fragment), and selecting such a link would be an error. For example, in Figure 4.2(c), the unexplored link (e, k) has value v(e, k) = 7, which is minimum among the unexplored edges incident on the fragment, and hence would be chosen; however, node e is already in the fragment. We could allow for errors: We choose among the unexplored links and, if the link (in our example: (e, k)) selected by the root s in step 3 turns out to be internal

SHORTEST PATH ROUTING

3

h

5

k

1

f

5

s

e

10

(b)

5 3

k

9

5

e

8

3

9

5

s

3

e

8

8

8

10

5

k

1

7

3

3

h

3

1 s

5

10

(a)

h

7

10

d

2

3 5

s

8 c

k

1

3

3

9

5

3

h

233

8

10 13

13 10

10

(c)

(d)

FIGURE 4.2: Determining the next link to be added to the fragment.

(k would ﬁnd out in step 4 when the notiﬁcation arrives), we eliminate that link from consideration and select another one. The drawback of this approach is its overall cost. In fact, since initially all links are unexplored, we might have to perform the entire selection process for every link. This means that the cost will be O(nm), which in the worst case is O(n3 ): a high price to construct a single routing table. A more efﬁcient approach is to add a mechanism so that no error will occur. Fortunately, this can be achieved simply and efﬁciently as follows. When a node b becomes part of the tree, it sends a message to all its neighbors notifying them that it is now part of the tree. Upon receiving such a message, a neighbor c knows that this link must no longer be used when performing shortest path calculations for the tree. As a side effect, in our example, when the link (s, e) is chosen in Figure 4.2(b), node e already knows that the link (e, k) leads to a node already in the fragment; thus such a link is not considered, as shown in Figure 4.2(d). RECALL. We have used a similar strategy with the protocol for depth-ﬁrst traversal, to decrease its time complexity. IMPORTANT. It is necessary for b to ensure that all its neighbors have received its message before a new iteration is started. Otherwise, due to time delays, a neighbor

234

MESSAGE ROUTING AND SHORTEST PATHS

c might receive the request to compute the minimum for the next iteration before the message from b has even arrived; thus, it is possible that c (not knowing yet that b is part of the tree) chooses its link to b as its minimum, and such a choice is selected as the overall minimum by the root s. In other words, it is still possible that an internal link is selected during an iteration. Summarizing, to avoid mistakes, it is sufﬁcient to modify rule 4 as follows: 4. The root s sends an Expand message to b and the link (a, b) is added to the spanning tree; b computes γs (b), sends a notiﬁcation to its neighbors, waits for their acknowledgment, and then notiﬁes s of the end of the iteration. This ensures that there will be only n − 1 iterations, each adding a new node to the spanning tree, with a total cost of O(n2 ) messages. Clearly we must also consider the cost of each node notifying its neighbors (and them sending acknowledgments), but this adds only O(m) messages in total. The protocol, called PT Construction, is shown in Figures 4.3–4.6. Analysis Let us now analyze the cost of protocol PT Construction in details. There are two basic activities being performed: the expansion of the current fragment of the tree and the announcement (with acknowledgments) of the addition of the new node to the fragment. Let us consider the expansion ﬁrst. It consists of a “start-up” (the root broadcasting the Start Iteration message), a “convergecast” (the minimum value is collected at the root using the MinValue messages), two “notiﬁcations” (the root notiﬁes the new node using the Expansion message, and the new node notiﬁes the root using the Iteration Completed message). Each of these operations is performed on the current fragment, which is a tree, rooted in the source. In particular, the start-up and the convergecast operations each cost only one message on every link; in the notiﬁcations, messages are sent only on the links in path from the source to the new node, and there will be only one message in each direction. Thus, in total, on each link of the tree constructed so far, there will be at most four messages due to the expansion; two messages will also be sent on the new link added in this expansion. Thus, in the expansion at iteration i, at most 4(ni − 1) + 2 messages will be sent, where ni is the size of the current tree. As the tree is expanded by one node at the time, ni = i. In fact, initially there is only the source; then the fragment is composed of the source and a neighbor, and so on. Thus, the total number of messages due to the expansion is n−1

n−1

i=1

i=1

(4(ni − 1) + 2) =

(4i − 2) = 2n(n − 1) − 2(n − 1) = 2n2 − 4n + 2.

The cost due to announcements and acknowledgments is simple to calculate: Each node will send a Notify message to all its neighbors when it becomes part of the tree

SHORTEST PATH ROUTING

235

PROTOCOL PT Construction.

States: S = { INITIATOR, IDLE, AWAKE, ACTIVE, WAITING FOR ACK, COMPUTING, DONE }; SINIT = { INITIATOR,IDLE }; STERM = { DONE }.

Restrictions: IR ; UI. INITIATOR

Spontaneously begin source:= true; my distance:= 0; ackcount:= |N (x)|; send(N otify) to N (x); end Receiving(Ack) begin ackcount:= ackcount - 1; if ackcount = 0 then iteration:= 1; v(x, y) := MIN{v(x, z) : z ∈ N (x)}; path length:= v(x, y); Children:={y}; send(Expand, iteration, path length) to y; Unvisited:= N (x) − {y}; become ACTIVE; endif end IDLE Receiving(N otify) begin Unvisited:= N (x) − {sender}; send(Ack) to sender; become AWAKE; end AWAKE Receiving(Expand, iteration , path value ) begin my distance:= path value ; parent:= sender; Children:= ∅; if |N (x)| > 1 then send(N otify) to N (x) − {sender}; ackcounter:= |N (x)| − 1; become WAITING FOR ACK; else send(I terationCompleted) to parent; become ACTIVE; endif end

FIGURE 4.3: Protocol PT-Construction (I)

236

MESSAGE ROUTING AND SHORTEST PATHS

AWAKE Receiving(N otify) begin Unvisited:= Unvisited−{sender}; send(Ack) to sender; end WAITING FOR ACK Receiving(Ack) begin ackcount:= ackcount - 1; if ackcount = 0 then send(I terationCompleted) to parent; become ACTIVE; endif end ACTIVE Receiving(I teration Completed) begin if not(source) then send(I teration Completed) to parent; else iteration:= iteration + 1; send(Start I teration, iteration) to children; Compute Local Minimum; childcount:= 0; become COMPUTING; endif end Receiving(Start I teration, iteration ) begin iteration:= iteration ; Compute Local Minimum; if children = ∅ then send(MinV alue, minpath) to parent; else send(Start I teration, iteration) to children; childcount:=0; become COMPUTING; endif end

FIGURE 4.4: Protocol PT-Construction (II)

and receives an Ack from each of them. Thus, the total number of messages due to the notiﬁcations is 2 |N(x)| = 2 deg(x) = 4m. x∈V

x∈V

To complete the analysis, we need to consider the ﬁnal broadcast of the Termination message, which is performed on the constructed tree; this will add n − 1 messages to the total, yielding the following: M[PT Construction] ≤ 2n2 + 4m − 3n + 1

(4.4)

SHORTEST PATH ROUTING

ACTIVE

237

Receiving(Expand, iteration , path value ) begin send(Expand, iteration , path value ) to exit; if exit = mychoice then Children := Children ∪ {mychoice}; Unvisited := Unvisited − {mychoice}; endif end Receiving(N otify) begin Unvisited:= Unvisited −{sender}; send(Ack) to sender; end Receiving(T erminate) begin send(T erminate) to children; become DONE; end

COMPUTING Receiving(MinV alue, path value ) begin if path value < minpath then minpath:= path value ; exit:= sender; endif childcount :=childcount + 1; if childcount = |Children| then if not(source) then send(MinV alue, minpath) to parent; become ACTIVE; else Check for Termination; endif endif end

FIGURE 4.5: Protocol PT Construction (III)

By adding a little bookkeeping, the protocol can be used to construct the routing table RT(s) of the source (Exercise 4.6.13). Hence, we have a protocol that constructs the routing table of a node using O(n2 ) messages. We will see later how more efﬁcient solutions can be derived for the special case when all the links have the same cost (or, alternatively, there is no cost on the links). Note that we have made no assumptions other than that the costs are non-negative; in particular, we did not assume ﬁrst in ﬁrst out (FIFO) channels (i.e., message ordering). 4.2.4 Constructing All-Pairs Shortest Paths Protocol PT Construction allows us to construct the shortest-path tree of a node, and thus to construct the routing table of that entity. To solve the original problem of constructing all the routing table, also known as all-pairs shortest-paths construction,

238

MESSAGE ROUTING AND SHORTEST PATHS

Procedure Check f or T ermination begin if minpath= inf then send(T erminate) to Children; become DONE; else send(Expand, iteration, minpath) to exit; become ACTIVE; endif end

Procedure Compute Local Minimum begin if Unvisited = ∅ then minpath:= inf; else link length:= v(x, y) = MIN{v(x, z) : z ∈ Unvisited}; minpath:= my distance + link length; mychoice:= exit:= y; endif end

FIGURE 4.6: Procedures used by protocol PT Construction

this process must be repeated for all nodes. The complexity of resulting protocol PT All follows immediately from equation 4.4: M[PT All] ≤ 2n3 − 3n2 + 4(m − 1)n

(4.5)

The costs of protocols Map Gossip, Iterative Construction, and PT All are shown in Figure 4.7. Deﬁnitively better than protocol Iterative Construction, protocol PT All matches the worst case cost of Map Gossip without requiring large amounts of local storage. Hence, it is an efﬁcient solution. It is clear that some information computed when constructing PT(x) can be reused in the construction of PT(y). For example, the shortest path from x to y is just the reverse of the one from y to x (under the bidirectional links assumption we are using); hence, we just need to determine one of them. Even stronger is the so-called optimality principle: Property 4.2.3 If a node x is in the shortest path π from a to b, then π is also a fragment of PT(x) Hence, once a shortest path π has been computed for the shortest path tree of an entity, this path can be added to the shortest path tree of all the entities in the path. So, in the example of Figure 4.1, the path (s, e)(e, f ) in PT(s) will also be a part of Algorithm Map Gossip Iterative Construction PT All SparserGossip

Cost O(n m) O(n2 m) O(n3 ) O(n2 log n)

restrictions

⍀(m) local storage

FIGURE 4.7: Constructing all shortest path routing tables.

SHORTEST PATH ROUTING

239

PT(e) and PT(f ). However, to date, it is not clear how this fact can be used to derive a more efﬁcient protocol for constructing all the routing tables. Constructing a Sparser Subgraph Interestingly, the number of messages can be brought down from O(n3 ) to O(n2 log n) not by cleverly exploiting information but rather by cleverly constructing a spanning subgraph of the network, called sparser and then simulating the execution of Map Gossip on it. To understand this subgraph, we need some terminology. Given a subset V ⊆ V of the nodes, we call the eccentricity of x ∈ V in V its largest distance from the other nodes of V , that is, r(x, V ) = maxy∈V {dG (x, y)}; then r(V ) = maxx∈V {r(x, V )} is called the radius of V . The density of x ∈ V in V instead is the number of its neighbors that are in V , that is, den(x, V ) = |N(x) ∪ V |; the density of V is the sum of the densities of all its nodes: den(V ) = x∈V den(x, V ). Given a collection A of subsets of the nodes, the radius r(A) of A will be just the largest among the radii of those subsets; the density den(A) will be just the sum of the densities of those subsets. A (a, b)-sparser is just a partition of the set V of nodes into subsets such that its radius is r(S) = a and its density is den(S) = b. The basic idea is to ﬁrst of all 1. construct a sparser V = V1 , . . . , Vk ; 2. elect a leader xi in each of its sets Vi ; 3. establish a path connecting the two leaders of each pair of neighboring subsets. Then the execution of the protocol in G is simulated in the sparser. What this means is that 4. each leader executes the algorithm for each node in its subset; 5. whenever in the algorithm a message is sent from a node in Vi to a node in Vj , the message is sent by xi to xj . An interesting consequence of (5) above is that the cost of a node u sending a message to all its neighbors, when simulated in the sparser, will depend on the number of subsets in which u has neighbors as well as on the distance between the corresponding leaders. This means that for the simulation to be efﬁcient, the radius should be small, r(V ) = O(log n), and the density at most linear, den(S) = O(n). Fortunately we have (Exercise 4.6.15): Property 4.2.4 Any connected graph G of n nodes has a (log n, n)-sparser. The existence of this good sparser is not enough; we must be able to construct it with a reasonable amount of messages. Fortunately, this is also possible (Exercise

240

MESSAGE ROUTING AND SHORTEST PATHS

4.6.16). When constructing it, there are several important details that must be taken care; in particular, the paths between the centers must be uniquely determined. Once all of this is done, we must then deﬁne the set of rules (Exercise 4.6.17) to simulate protocol MapGossip. At this point, the resulting protocol, called SparserGossip, yields the desired performance M[SparserGossip] = O(n2 log n).

(4.6)

Using Long Messages In systems that allow very long messages, not surprisingly the problem can be solved with fewer messages. For example, if messages can contain O(n) items of information (instead of O(1)), all the shortest path trees can be constructed with just O(n2 ) messages (Exercise 4.6.18). If messages can contain O(n2 ) items, then any graph problem including the construction of all shortest path trees can be solved using O(n) messages once a leader has been elected (requiring at least O(m + n log n) normal messages). A summary of all these results is shown in Figure 4.7. 4.2.5 Min-Hop Routing Consider the case when all links have the same cost (or alternatively, there are no costs associated to the links), that is, θ(a, b) = θ for all (a, b) ∈ E. This case is special in several respects. In particular, observe that the shortest path from a to b will have cost γa (b) = θ dG (a, b), where dG (a, b) is the distance (in number of hops) of a from b in G; in other words, the cost of a path will depend solely on the number of hops (i.e., the number of links) in that path. Hence, the shortest path between two nodes will be the one with minimum hops. For these reasons, routing in this situation is called min-hop routing. An interesting consequence is that the shortest path spanning tree of a node coincides with its breadth-ﬁrst spanning tree. In other words, a breadth-ﬁrst spanning tree rooted in a node is the shortest path spanning tree of that node when all links have the same cost. Protocol PT Construction works for any choice of the costs, provided they are non-negative; so it constructs a breadth-ﬁrst spanning tree if all the costs are the same. However, we can take advantage of the fact that all links have the same costs to obtain a more efﬁcient protocol. Let us see how. Breadth-First Spanning-Tree Construction Without any loss of generality, let us assume that θ = 1; thus, γs (a) = dG (s, a). We can use the same strategy of protocol PT Construction of starting from s and successively expanding the fragment; only, instead of choosing one link (and thus one node) at the time, we can choose several simultaneously: In the ﬁrst step, s chooses all the nodes at distance 1 (its neighbors); in the second step, s chooses simultaneously all the nodes at distance 2; in general, in step i, s chooses simultaneously all the nodes at distance i; notice that before step i, none of the nodes at distance i was a part of the

SHORTEST PATH ROUTING

241

fragment. Clearly, the problem is to determine, in step i, which nodes are at distance i from s. Observe this very interesting property: All the neighbors of s are at distance 1 from s; all their neighbors (not at distance 1 from s) are at distance 2 from s; in general, Property 4.2.5 If a node is at distance i from s, then its neighbors are at distance either i − 1 or i or i + 1 from s. This means that once the nodes at distance i from s have been chosen (and become part of the fragment), we need to consider only their neighbors to determine which nodes are at distance i + 1. So the protocol, which we shall call BF, is rather simple. Initially, the root s sends a “start iteration 1” message to each neighbor indicating the ﬁrst iteration of the algorithm and considers them its children. Each recipient marks its distance as 1, marks the sender as its parent, and sends an acknowledgment back to the parent. The tree is now composed of the root s and its neighbors, which are all at distance 1 from s. In general, after iteration i all the nodes at distance up to i are part of the tree. Furthermore, each node at distance i knows which of its neighbors are at distance i − 1 (Exercise 4.6.19). In iteration i + 1, the root broadcasts on the current tree a “start iteration i + 1” message. Once this message reaches a node x at distance i, it sends a “explore i + 1” message to its neighbors that are not at distance i − 1 (recall, x knows which they are) and waits for a reply from each of them. These neighbors are either at distance i like x itself, or at i + 1; those at distance i are already in the tree and so do not need to be included. Those at distance i + 1 must be attached to the tree; however, each must be attached only once (otherwise we create a cycle and do not form a tree; see Figure 4.8). When a neighbor y receives the “Explore” message, the content of its reply will depend on whether or not y is already part of the tree. If y is not part of the tree, it now knows that it is at distance i + 1 from s; it then marks the sender as its parent, sends a positive acknowledgment to it, and becomes part of the tree. If y is part of the tree (even if it just happened in this iteration), it will reply with a negative acknowledgment. When x receives the reply from y, if the reply is positive, it will mark y as a child, otherwise, it will mark y as already in the tree. Once all the replies have been received, it participates in a convergecast notifying the root that the iteration has been completed. Cost Let us now examine the cost of protocol BF. Denote by ni the number of nodes at distance at most i from s. In each iteration, there are three operations involving communication: (1) the broadcast of “Start”on the tree constructed so far; (2) the sending of “Explore” messages sent by the nodes at distance i, and the corresponding replies; and (3) the convergecast to notify the root of the termination of the iteration. Consider ﬁrst the cost of operation (2), that is, the cost of the “Explore” messages and the corresponding replies. Consider a node x at distance i. As already mentioned, its neighbors are at distance either i − 1 or i or i + 1. The neighbors at distance i − 1

242

MESSAGE ROUTING AND SHORTEST PATHS

FIGURE 4.8: Protocol BF expands an entire level in each iteration.

sent an “Explore” message to x in stage i − 1, so x sent a reply to each of them. In stage i x sent an “Explore” message to all its other neighbors. Hence, in total, x sent just one message (either “Explore” or reply) to each of its neighbors. This means that in total, the number of “Explore” and “Reply” messages is

| N (x) = 2m.

x∈V

We will consider now the overall cost of operations (1) and (3). In iteration i + 1, both broadcast and convergecast are performed on the tree constructed in iteration i, thus costing ni − 1 messages each, for a total of 2ni − 2 messages. Therefore, the total cost will be

2(ni − 1),

1≤i 0, x sends Explore(j + 1, k − 1) to all its neighbors except its parent. If k = 0, then a positive reply Positive(j ) is sent to the parent y. 2. Let j > levelx . In this case, this is not a shorter path to x; x replies with a negative acknowledgment Negative(j ). When x receives a reply from its neighbor z: 1. If the level of the reply is (levelx + 1) then: (a) if the reply is Negative(levelx + 1), then x considers z a non-child. (b) if the reply is Positive(levelx + 1) then x considers z a child. (c) If, with this message, x has now received a reply with level (levelx + 1) from all its neighbors except its parent, then it sends Positive(levelx ) to its parent. 2. If the level of the reply is not (levelx + 1) then the message is discarded. FIGURE 4.10: Exploration phase of BF Levels: x is not part of the current fragment

246

MESSAGE ROUTING AND SHORTEST PATHS

Correctness During the extension phase all the nodes at distance at most t + l from the root are indeed reached, as can be easily veriﬁed (Exercise 4.6.23). Thus, to prove the correctness of the protocol, we need just to prove that those nodes will be attached to the existing fragment at the proper level. We will prove this by induction on the levels. First of all, all the nodes at level t + 1 are neighbors of the sources and thus each will receive at least one Explore(t + 1, l) message; when this happens, regardless of whatever has happened before, each will set its level to t + 1; as this is the smallest level that they can ever receive, their level will not change during the rest of the iteration. Let it be true for the nodes up to level t + k, 1 ≤ k ≤ l − 1; we will show that it also holds for the nodes in level t + k + 1. Let π be the path of length t + k + 1 from s to x and let u be the neighbor of x in this path; by deﬁnition, u is at level t + k and, by inductive hypothesis, it has correctly set (levelu ) = t + k. When this happened, u sent a message Explore(t + k + 1, l − k − 1) to all its neighbors, except its parent. As x is clearly not u’s parent, it will eventually receive this message; when this happens, x will correctly set (levelx ) = t + k + 1. So we must show that the expansion phase will not terminate before x receives this message. Focus again on node u; it will not send a positive acknowledgment to its parent (and thus the phase can not terminate) until it receives a reply from all its other neighbors, including x. As, to reply, x must ﬁrst receive the message, x will correctly set its level during the phase. Cost To determine the cost of protocol BF Levels, we need to analyze the cost of the synchronization and of the expansion phases. The cost of a synchronization, as we discussed earlier, is at most 2(n − 1) messages, as both the initialization broadcast and the termination convergecast are performed on the currently available tree. Hence, the total cost of all synchronization activities depends on the number of iterations. This quantity is easily determined. As there are radius(r) < d(G) levels, and we add l levels in every iteration, except in the last where we add the rest, the number of iterations is at most d(G)/ l. This means that the total amount of messages due to synchronization is at most 2(n − 1)

d(G) l

≤ 2

(n − 1)2 . l

(4.9)

Let us now analyze the cost of the expansion phase in iteration i, 1 ≤ i ≤ d(G)/ l. Observe that in this phase, only the nodes in the levels L(i) = {(i − 1)l + 1, (i − 1)l + 2, . . . , il − 1, il} as well as the sources (i.e., the nodes at level (i − 1)l) will be involved, and messages will only be sent on the mi links between them. The messages sent during this phase will be just Explore(t + 1, l), Explore(t + 2, l − 1), Explore(t + 3, l − 2), . . . , Explore(t + l, 0), and the corresponding replies will be Positive(j ) or Negative(j ), t + 1 ≤ j ≤ t + l. A node in one of the levels in L(i) sends to its neighbors at most one of each of those Explore messages; hence there will be on each of edge at most 2l Explore messages (l in each direction), for a total of 2lmi . As for each Explore there is at most one reply, the total number of messages sent in this phase will be no more than 4lmi .

SHORTEST PATH ROUTING

247

This fact, observing that the set of links involved in each iteration are disjoint, yields less than d(G)/ l

4 l mi = 4 l m

(4.10)

i=1

messages for all the explorations of all iterations. Combining equations (4.9) and (4.10), we obtain

M[BF Levels] ≤

2(n − 1)d(G) + 4 l m. l

(4.11)

√ If we choose l = O(n/ m), expression (4.11) becomes M[BF Levels]= O(n

√ m).

This formula is quite interesting. In fact, it depends not only on n but also on the square root of the number m of links. If the network is sparse (i.e., it has O(n) links), then the protocol uses only O(n1.5 ) messages; note that this occurs in any planar network. The worst case will be with very dense networks (i.e., m = O(n2 )). However, in this case the protocol will use at most O(n2 ) messages, which is no more than protocol BF . In other words, protocol BF Levels will have the same cost as protocol BF only for very dense networks and will be much better in all other systems; in particular, whenever m = o(n2 ), it uses a subquadratic number of messages. Let us consider now the ideal time costs of the protocol. Iteration i consists of reaching levels L(i) and returning to the root; hence the ideal time will be exactly 2il if 1 ≤ i < d(G)/ l, and time 2d(G) in the last iteration. Thus, without considering the roundup, in total we have

T[BF Levels] =

d(G)/ l i=1

2li =

d(G)2 + d(G). l

(4.12)

√ The choice l = O(n/ m) we considered when counting the messages will give √ T[BF Levels]= O(d(G)2 m/n),

248

MESSAGE ROUTING AND SHORTEST PATHS

TABLE 4.6: Summary: Costs of Constructing a Breadth-ﬁrst Tree Network General General Planar

Algorithm BF BF Levels BF Levels

Messages O(m + √ nd) O(n m) O(n1.5 )

Time 2 O(d √ ) O(d 2 √ m/n + d) O(d 2 / n + d)

which, again, is the same ideal time as protocol BF only for very dense networks, and less in all other systems. Reducing Time with More Messages () If time is of paramount importance, better results can be obtained at the cost of more messages. For example, if in protocol BF Levels we were to choose l = d(G), we would obtain an optimal time costs. T[BF Levels]= 2d(G). IMPORTANT. We measure ideal time considering a synchronous execution where the communication delays are just one unit of time. In such an execution, when l = d(G), the number of messages will be exactly 2m + n − 1 (Exercise 4.6.25). In other words, in this synchronous execution, the protocol has optimal message costs. However, this is not the message complexity of the protocol, just the cost of that particular execution. To measure the message complexity we must consider all possible executions. Remember that to measure ideal time we consider only synchronous executions, while to measure message costs we must look at all possible executions, both synchronous and asynchronous (and choose the worst one). The cost in messages choosing l = d(G) is given by expression (4.11) that becomes O(m d(G)). This quantity is reasonable only for networks of small degree. By the way, a priori knowledge of d(G) is not necessary to obtain these bounds (either time or messages; Exercise 4.6.24). If we are willing to settle for a low but suboptimal time, it is possible to achieve it with a better message complexity. Let us see how. In protocol BF Levels the network (and thus the tree) is viewed as divided into “strips,” each containing l levels of the tree. See Figure 4.11. The way the protocol works right now, in the expansion phase, each source (i.e., each leaf of the existing tree) constructs its own bf-tree over the nodes in the next l levels. These bf-trees have differential growth rates, some growing quickly, some slowly. Thus, it is possible for a quickly growing bf-tree to have processed many more levels than a slower bf-tree. Whenever there are conﬂicts due to transmission delays (e.g., the arrival of a message with a better level) or concurrency (e.g., the arrival of another message with the same level), these conﬂicts are resolved, either

SHORTEST PATH ROUTING

249

s l

l l l l

FIGURE 4.11: We need more efﬁcient expansion of l levels in each iteration.

by “trowing away” everything already done and joining the new tree or sending a negative reply. It is the amount of work performed to take care of these conﬂicts that drives the costs of the protocol up. For example, when a node joins a bf-tree and has a (new) parent, it must send out messages to all its other neighbors; thus, if a node has a high degree and frequently changes trees, these adjacent edge messages dominate the communication complexity. Clearly, the problem is how to perform these operations efﬁciently. Conﬂicts and overlap occurring during the constructions of those different bf-trees in the l levels can be reduced by organizing the sources into clusters and coordinating the actions of the sources that are in the same cluster, as well as coordinating the different clusters. This in turn requires that the sources in the same cluster must be connected so as to minimize the communication costs among them. The connection through a tree is the obvious option and is called a cover tree. To avoid conﬂicts, we want that for different clusters the corresponding cover trees have no edges in common. So we will have a forest of cover trees, which we will call the cover of all the sources. To coordinate the different clusters in the cover, we must be able to reach all sources; this, however, can already be done using the current fragment (recall, the sources are the leaves of the fragment). The message costs of the expansion phase will grow with the number of different clusters competing for the same node (the so-called load factor); on the contrary, the time costs will grow with the depth of the cover trees (the so-called depth factor). Notice that it is possible to obtain tradeoffs between the load factor and the depth factor by varying the size of the cover (i.e., the number of trees in the forest), for example, increasing the size of the forest reduces the depth factor while increasing the load factor. We are thus faced with the problem of constructing clusters with small amount of competition and shallow cover trees. Achieving this goal yields a time cost of O(d 1+ ) and a message cost of O(m1+ ) for any ﬁxed > 0. See Exercise 4.6.26.

250

MESSAGE ROUTING AND SHORTEST PATHS

4.2.6 Suboptimal Solutions: Routing Trees Up to now, we have considered only shortest-path routing, that is, we have been looking at systems that always route a message to its destination through the shortest path. We will call such mechanisms optimal. To construct optimal routing mechanisms, we had to construct n shortest path trees, one for each node in the network, a task that we have seen is quite communication expensive. In some cases, the shortest path requirement is important but not crucial; actually, in many systems, guarantee of delivery with few communication activities is the only requirement. If the shortest path requirement is relaxed or even dropped, the problem of constructing a routing mechanism (tables and forwarding scheme) becomes simpler and can be achieved quite efﬁciently. Because they do not guarantee shortest paths, such solutions are called suboptimal. Clearly there are many possibilities depending on what (suboptimal) requirements the routing mechanism must satisfy. A particular class of solutions is the one using a single spanning tree of the network for all the routing, which we shall call routing tree. The advantages of such an approach are obvious: We need to construct just one tree. Delivery is guaranteed and no more that diam(T ) messages will be used on the tree T. Depending on which tree is used, we have different solutions. Let us examine a few. Center-Based Routing. As the maximum number of messages used to deliver a message is at most diam(T), a natural choice for a routing tree is the spanning tree with a small diameter. One such a tree is shortest path tree rooted in a center of the network. In fact, let c a center of G (i.e., a node where the maximum distance is minimized) and let PT(c) be the shortest path tree of c. Then (Exercise 4.6.27), diam(G) ≤ diam(PT(c)) ≤ 2diam(G). To construct such a tree, we need ﬁrst of all to determine a center c and then construct PT(c), for example, using protocol PT Construction. Median-Based Routing. Once we choose a tree T, an edge e = (x, y) of T linking the subtree T [x − y] to the subtree T [y − x] will be used every time a node in T [x − y] wants to send a message to a node in T [y − x], and viceversa (see Figure 4.12), where each use costs θ (e). Thus, assuming that overall every node generates the same amount of messages for every other node and all nodes overall generate the same amount of messages, the cost of using T for routing all this trafﬁc is Trafﬁc(T ) =

|T [x − y]| |T [y − x]| θ (x, y).

(x,y)∈T

It is not difﬁcult to see that such a measure is exactly the sum of all distances between nodes (Exercise 4.6.28). Hence, the best tree T to use is one that

SHORTEST PATH ROUTING

x

T [x−y]

251

y

T [y−x]

FIGURE 4.12: The message trafﬁc between the two subtrees passes through edge e = (x, y).

minimizes the sum of all distances between nodes. Unfortunately, to construct the minimum-sum-distance spanning tree of a network is not simple. In fact, the problem is NP-hard. Fortunately, it is not difﬁcult to construct a near-optimal solution. In fact, let z be a median of the network (i.e., a node for which the sum of distances SumDist(z) = v∈V dG (x, z) to all other nodes is minimized) and let PT(z) be the shortest path tree of z. If T is the spanning tree that minimizes trafﬁc, then (Exercise 4.6.29) Trafﬁc(PT(z)) ≤ 2 Trafﬁc(T ). Thus, to construct such a tree, we need ﬁrst of all to determine a median z and then construct PT(z), for example, using protocol PT Construction. Minimum-Cost Spanning-Tree Routing. A natural choice for routing tree is a minimum-cost spanning tree (MST) of the network. The construction of such a tree can be done, for example, using protocol MegaMerger discussed in Chapter 3. All the solutions above have different advantages; for example, the center-based one offers the best worst-case cost, while the median-based one has the best average cost. Depending on the nature of the systems and of the applications, each might be preferable to the others. There are also other measures that can be used to evaluate a routing tree. For example, a common measure is the so-called stretch factor σG (T ) of a spanning tree T of G deﬁned as σG (T ) = Maxx,y∈V

dT (x, y) . dG (x, y)

(4.13)

In other words, if a spanning tree T has a stretch factor α, then for each pair of nodes x and y, the cost of the path from x to y in T is at most α times the cost of the shortest path between x and y in G. A design goal could thus be to determine spanning trees with small stretch factors (see Exercises 4.6.30 and 4.6.31). These ratios are sometimes difﬁcult to calculate. Alternate, easier to compute, measures are obtained by taking into account only pairs of neighbors (instead of pairs of arbitrary nodes). One such measure is the

252

MESSAGE ROUTING AND SHORTEST PATHS

so-called dilation, that is the length of the longest path in the spanning tree T corresponding to an edge of G, deﬁned as dilationG (T) = Max(x,y)∈E dT (x, y).

(4.14)

We also can deﬁne the edge-stretch factor G (T ) (or dilation factor) of a spanning tree T of G as G (T ) = Max(x,y)∈E

dT (x, y) . θ(x, y)

(4.15)

As an example, consider the spanning tree PT(c) used in the center-based solution; if all the link costs are the same, we have that for every two nodes x and y 1 ≤ dG (x, y) ≤ dPT(c) (x, y) ≤ dPT(c) = dG . This means that in PT(c) (unweighted) stretch factor σG (T ), dilation dilationG (T ), and edge-stretch factor G (T ) are all bounded by the same quantity, the diameter dG of G. For a given spanning tree T, the stretch factor and the dilation factor measure the worst ratio between the distance in T and in G for the same pair of nodes and the same edge, respectively. Another important cost measure is the average stretch factor describing the average ratio: σ G (T ) = Averagex,y∈V

dT (x, y) dG (x, y)

(4.16)

and the average edge-stretch factor (or average dilation factor) G (T ) of a spanning tree T of G as G (T ) = Average(x,y)∈E

dT (x, y) . θ (x, y)

(4.17)

Construction of spanning trees with low average edge-stretch can be done effectively (Exercises 4.6.35 and 4.6.36). Summarizing, the main disadvantage of using a routing tree for all routing tasks is the fact that the routing path offered by such mechanisms is not optimal. If this is not a problem, these solutions are clearly a useful and viable alternative to shortest path routing. The choice of which spanning tree, among the many, should be used depends on the nature of the system and of the application. Natural choices include the ones described above, as well as those minimizing some of the cost measures we have introduced (see Exercises 4.6.31, 4.6.32, 4.6.33).

COPING WITH CHANGES

253

4.3 COPING WITH CHANGES In some systems, it might be possible that the cost associated to the links change over time; think, for example, of having a tariff (i.e., cost) for using a link during weekdays different from the one charged in the weekend. If such a change occurs, the shortest path between several pairs of node might change, rendering the information stored in the tables obsolete and possibly incorrect. Thus, the routing tables need to be adjusted. In this section, we will consider the problem of dealing with such events. We will assume that when the cost of a link (x, y) changes, both x and y are aware of the change and of the new cost of the link. In other words, we will replace the Total Reliability restriction with Total Component Reliability (thus, the only changes are in the costs) in addition to the Cost Change Detection restriction. Note that costs that change in time can also describe the occurrence of some link failures in the system: The crash failure of an edge can be described by having its cost becoming exceedingly large. Hence, in the following, we will talk of link crash failures and of cost changes as the same types of events. 4.3.1 Adaptive Routing In these dynamical networks where cost changes in time, the construction of the routing tables is only the ﬁrst step for ensuring (shortest path) routing: There must be a mechanism to deal with the changes in the network status, adjusting the routing tables accordingly. Map Update A simple, albeit expensive solution is the Map Update protocol. It requires ﬁrst of all that each table contains the complete map of the entire network; the next “hop” for a message to reach its destination is computed on the basis of this map. The construction of the maps can be done, for example, using protocol Map Gossip discussed in Section 4.2.1. Clearly, any change will render the map inaccurate. Thus, integral part of this protocol is the update mechanism: Maintenance as soon as an entity x detects a local change (either in the cost or in the status of an incident link), x will update its map accordingly and inform all its neighbors of the change through an “update” message; as soon as an entity y receives an “update” from a neighbor, it will update its map accordingly and inform all its neighbors of the change through an “update” message. NOTE. In several existing systems, an even more expensive periodic maintenance mechanism is used: Step 1 of the maintenance mechanism is replaced by having each node, periodically and even if there are no detected changes, send its entire map to all its neighbors. This is, for example, the case with the second Internet routing protocol:

254

MESSAGE ROUTING AND SHORTEST PATHS

The complete map is being sent to all neighbors every 10–60 s (10 s if there is a cost change; 60 s otherwise). The great advantage of this approach is that it is fully adaptive and can cope with any amount and type of changes. The clear disadvantage is the amount of information required locally and the volume of transmitted information. Vector Update To alleviate some of the disadvantages of the Map Update protocol, an alternative solution consists in using protocol Iterative Construction, that we designed to construct the routing tables, to keep them up-to-date should faults or changes occur. Every entity will just keep its routing table. Note that a single change might make all the routing tables incorrect. To complicate things, changes are detected only locally, where they occur, and without a full map it might be impossible to detect if it has any impact on a remote site; furthermore, if more several changes occur concurrently, their cumulative effect is unpredictable: A change might “undo” the damage inﬂicted to the routing tables by another change. Whenever an entity x detects a local change (either in the cost or in the status of an incident link), the update mechanism is invoked, which will trigger an execution of possibly several iterations of protocol Iterative Construction. In regard to the update mechanism, we have two possible choices: recompute the routing tables: everybody starts a new execution of the algorithm, trowing away the current tables, or update current information: everybody starts a new iteration of the algorithm with x using the new data, continuing until the tables converge. The ﬁrst choice is very costly because, as we know, the construction of the routing tables is an expensive process. For these reasons, one might want to recompute only what and when is; hence the second choice is preferred. The second choice was used as the original Internet routing protocol; unfortunately, it has some problems. A well known problem is the so-called count-to-inﬁnity problem. Consider the simple network shown in Figure 4.13. Initially all links have cost 1. Then the cost of link (z, w) becomes a large integer K >> 1. Both nodes z and w will then start an iteration that will be performed by all entities. During this iteration, z is told by y that there is a path from y to w of cost 2; hence, at the end of the iteration, z sets its distance to w to 3. In the next iteration, y sets its distance from w to 4 because the best path to w (according to the vectors it receives from x and z) is through x. In general, after the (2i + 1)th iteration, x and z will set their cost for reaching w to 2(i + 1) + 1, while z will set it to 2(i + 1). This process will continue until z sets its cost for w

x

1

y

1

z

1

K

FIGURE 4.13: The count-to-inﬁnity problem.

w

COPING WITH CHANGES

255

to the actual value K. As K can be arbitrarily large, the number of iterations can be arbitrarily large. Solving this problem is not easy. See Exercises 4.6.38 and 4.6.39. Oscillation We have seen some approaches to maintain routing information in spite of failures and changes in the system. A problem common to all the approaches is called oscillation. It occurs if the cost of a link is proportional to the amount of trafﬁc on the link. Consider, for example, two disjoint paths π1 and π2 between x and y, where initially π1 is the “best” path. Thus, the trafﬁc is initially sent to π1 ; this will have the effect of increasing its cost until π2 becomes the best path. At this point the trafﬁc will be diverted on π2 increasing its cost, and so forth. This oscillation between the two paths will continue forever, requiring continuous execution of the update mechanism. 4.3.2 Fault-Tolerant Tables To continue to deliver a message through a shortest path to its destination in presence of cost changes or link crash failures, an entity must have up-to-date information on the status of the system (e.g., which links are up, their current cost, etc.). As we have seen, maintaining the routing tables correct when the topology of the network or the edge values may change is a very costly operation. This is true even if faults are very limited. Consider, for example, a system where at any time there is at most one link down (not necessarily the same one at all times), and no other changes will ever occur in the system; this situation is called single link crash failure (SLF). Even in this restricted case, the amount of information that must be kept in addition to the shortest paths is formidable (practically the entire map). This is because the crash failure of a single edge can dramatically change all the shortest path information. As the tables must be able to cope with every possible choice of the failed link, even in such a limited case, the memory requirements soon become unfeasible. Furthermore when a link fails, every node must be notiﬁed so that it can route messages along the new shortest paths; the subsequent recovery of that node also will require such a notiﬁcation. Such a notiﬁcation process needs to be repeated at each crash failure and recovery, for the entire lifetime of the system. Hence, the amount of communication is rather high and never ending as long as there are changes. Summarizing, the service of delivering a message through a shortest path in presence of cost changes or link crash failures, called shortest path rerouting (SR), is expensive (sometimes to the point of being unfeasible) both in terms of storage and communication. The natural question is whether there exists a less expensive alternative. Fortunately, the answer is positive. In fact, if we relax the shortest path rerouting requirement and settle for lower quality services, then the situation changes drastically; for example, as we will see, if the requirement is just message delivery (i.e., not necessarily through a shortest path), this service be achieved in our SLF system with very simple routing tables and without any maintenance mechanism.

256

MESSAGE ROUTING AND SHORTEST PATHS

In the rest of this section, we will concentrate on the single-link crash failure case. Point-of-failure Rerouting To reduce the amount of communication and of storage, a simple and convenient alternative is to offer, after the crash failure of an arbitrary single link, a lower quality service called point-of-failure rerouting (PR): Point-of-failure (Shortest path) Rerouting: 1. if the shortest path is not affected by the failed link, then the message will be delivered through that path; 2. otherwise, when the message reaches the node where the crash failure has occurred (the “point of failure”), the message will then be rerouted through a (shortest) path to its destination if no other failure occurs. This type of service has clearly the advantage that there is no need to notify the entities of a link crash failure and its subsequent reactivation (if any): The message is forwarded as there are no crash failures and if, by chance, the next link it must take has failed, it will be just then provided with an alternative route. This means that once constructed with the appropriate information for rerouting, the routing tables do not need to be maintained or updated. For this reason, the routing tables supporting such a service are called fault-tolerant tables. The amount of information that a fault-tolerant table must contain (in addition to the shortest paths) to provide such a service will depend on what type of information is being kept at the nodes to do the rerouting and on whether or not the rerouting is guaranteed to be through a shortest path. A solution consists in every node x knowing two (or more) edge-disjoint paths for each destination: the shortest path, and a secondary one to be used only if the link to the next “hop” in the shortest path has failed. So the routing mechanism is simple: When a message for destination r arrives at x, x determines the neighbor y in the shortest path to r. If (x,y) is up, x will send the message to y, otherwise, it will determine the neighbor z in the secondary path to r and forward the message to z. The storage requirements of this solution are minimal: For each destination, a node needs to store in its routing table only one link in addition to the one in the fault-free shortest path. As we already know how to determine the shortest path trees, the problem is reduced to the one of computing the secondary paths (see Exercise 4.6.37). NOTE. The secondary paths of a node do not necessarily form a tree. A major drawback of this solution is that rerouting is not through a shortest path: If the crash failure occurs, the system does not provide any service other than message delivery. Although acceptable in some contexts, this level of service might not be

COPING WITH CHANGES

257

tolerable in general. Surprisingly, it is actually possible to offer shortest path rerouting storing at each node only one link for each destination in addition to the one in the fault-free shortest path. We are now going to see how to design such a service. Point-of-Failure Shortest Path Rerouting Consider a message originated by x and whose destination is s; its routing in the system will be according to the information contained in the shortest path spanning tree PT(s). The tree PT(s) is rooted in s; so every node x = s has a parent ps (x), and every edge in PT(s) links a node to its parent. When the link es [x] = (ps (x), x) fails, it disconnects the tree into two subtrees, one containing s and the other x; call them T [s − x] and T [x − s]; see Figure 4.14. When ex fails, a new path from x to s must be found. It cannot be any: It must be the shortest path possible between x and s in the network without es [x]. Consider a link e = (u, v) ∈ G \ PT(s), not part of the tree, that can reconnect the two subtrees created by the crash failure of es [x], that is, u ∈ T [s − x] and v ∈ T [x − s]. We will call such a link a swap edge for es [x]. Using e we can create a new path from x to s. The path will consist of three parts: the path from x to v in T [x/ex ], the edge (u, v), and the path from u to s; see Figure 4.15. The cost of going from x to s using this path will then be dPT(s) (s, u) + θ(u, v) + dPT(s) (v, x) = d(s, u) + θ (u, v) + d(v, x). This is the cost of using e as a swap for es [x]. For each es [x] there are several edges that can be used as swaps, each with a different cost. If we want to offer shortest path rerouting from x to s when es [x] fails, we must use the optimal swap, that is the swap edge for es [x] of minimum cost.

s

p (x) s

x

T [s−x]

T [x−s]

FIGURE 4.14: The crash failure of es [x] = (ps (x), x) disconnects the tree P T (s).

258

MESSAGE ROUTING AND SHORTEST PATHS

s

x

u

v

FIGURE 4.15: Point-of-failure rerouting using the swap edge e = (u, v) of es [x].

So the ﬁrst task that must be solved is to how ﬁnd the optimal swap for each edge es [x] in PT(s). This computation can be done efﬁciently (Exercises 4.6.40 and 4.6.41); its result is that every node x knows the optimal swap edge for its incident link es [x]. To be used to construct the routing tables, this process must be repeated n times, one for each destination s (i.e., for each shortest path spanning tree PT(s)). Once the information about the optimal swap edges has been determined, it needs to be integrated in the routing tables so as to provide point-of-failure shortest path rerouting. The routing table of a node x must contain information about (1) the shortest paths as well as about (2) the alternative paths using the optimal swaps: 1. Shortest path information. First and foremost, the routing table of x contains for each destination s the link to the neighbor in the shortest path to s if there are no failures. Denote by ps (x) this neighbor. The choice of symbol is not accidental: This neighbor is the parent of x in PT(s) and the link is really es [x] = (ps (x), x). 2. Alternative path information. In the entry for the destination s, the routing table of x must also contain the information needed to reroute the message if es [x] = (ps (x), x) is down. Let us see what this information is. Let e = (u, v) be the optimal swap edge that x has computed for es [x]; this means that the shortest path from x to s if es [x] fails is by ﬁrst going from x to v, then over the link (u, v), and ﬁnally from u to s. In other words, if es [x] fails, x must reroute the message for s to v, that is, x must send it to its neighbor in the shortest path to v. The shortest paths to v are described by the tree PT(v); in fact, this neighbor is just pv (x) and the link over which the message to s must be sent when rerouting is precisely ev [x] = (pv (x), x) (see Exercise 4.6.42). Concluding, the additional information x must keep in the entry for destination s are the rerouting link ev [x] = (pv (x), x) and the closest node v on the optimal swap edge for es [x]; this information will be used only if es [x] is down.

COPING WITH CHANGES

259

TABLE 4.7: Entry in the Routing Table of x; e=(u, v) is the Optimal Swap Edge for es [x] Final Destination

Normal Link

Rerouting Link

Swap Destination

Swap Link

s

(ps (x), x)

(pv (x), x)

v

(u,v)

Any message must thus contain, in addition to the ﬁnal destination (node s in our example), also a ﬁeld indicating the swap destination (node v in our example), the swap link (link (u, v) in our example), and a bit to explain which of the two must be considered (see Table 4.7). The routing mechanism is rather simple. Consider a message originating from r for node s. PSR Routing Mechanism 1. Initially, r sets the ﬁnal destination to s, the swap destination and the swap link to empty, and the bit to 0; it then sends the message toward the ﬁnal destination using the normal link indicated in its routing table. 2. If a node x receives the message with ﬁnal destination s and bit set to 0, then (a) if x = s, the message has reached its destination: s processes the message; (b) if es [x] = (ps (x), x) is up, x forwards the unchanged message on that link; (c) if es [x] = (ps (x), x) is down, then x i. copies to the swap destination and swap link ﬁelds of the message the swap destination and swap link entries for s in its routing table; ii. sets the bit to 1; iii. sends the message on the rerouting link indicated in its table. 3. If a node x receives the message with ﬁnal destination s and bit set to 1, and swap destination set to v, then (a) if x = v, then i. it sets the bit to 0; ii. it sends the message on the swap link; (b) otherwise, it forwards the unchanged message on the link ev [x] = (pv (x), x). 4.3.3 On Correctness and Guarantees Adaptive Routing In all adaptive routing approaches, maintenance of the tables is carried out by broadcasting information about the status of the network; this can

Destination

Mode

SwapDest

SwapLink

Content

s

1

v

(u, v)

INFO

FIGURE 4.16: Message rerouted by x using the swap edge e =(u, v) of es [x].

260

MESSAGE ROUTING AND SHORTEST PATHS

be done periodically or just when changes do occur. In all cases, news of changes detected by a node will eventually reach any node (still connected to it). However, because of time delays, while an update is being disseminated, nodes still unaware will be routing messages on the basis of incorrect information. In other words, as long as there are changes occurring in the system (and for some time afterwards), the information in the tables is unreliable and might be incorrect. In particular, it is likely that routing will not be done through a shortest path; it is actually possible that messages might not be delivered as long as there are changes. This sad status of affairs is not due to the individual solutions but solely due to the fact that time delays are unpredictable. As a result, it is impossible to make any guarantee on correctness and in particular on shortest path delivery for adaptive routing mechanisms. This situation occurs even if the changes at any time are few and their nature limited, as the SLF. It would appear that we should be able to operate correctly in such a system; unfortunately this is not true: It is impossible to provide shortest path routing even in the single-link crash failure case. This is because the crash failure of a single edge can dramatically change all the shortest path information; thus, when the link fails, every node must be notiﬁed so that it can route messages along the new shortest paths; the subsequent recovery of that node will also require such a notiﬁcation. Such a notiﬁcation process needs to be repeated at each crash failure and recovery, and again the unpredictable time delays will make it impossible to guarantee correctness of the information available at the entities, and thus of the routing decision they make on the basis of that information. Question. What, if anything, can be guaranteed? The only think that we can say is that, if the changes stop (or there are no changes for a long period of time), then the updates to the routing information converge to the correct state, and routing will proceed according to the existing shortest paths. In other words, if the “noise” caused by changes stops, eventually the entities get the correct result. Fault-Tolerant Tables In the fault-tolerant tables approach, no maintenance of the routing tables is needed once they have been constructed. Therefore, there are no broadcasts or notiﬁcations of changes that, because of delays, might affect the correctness of the routing. However, also, fault-tolerant tables suffer because of the unpredictability of time delays. For example, even with the single-link crash failure, point-of-failure shortestpath rerouting can not be guaranteed to be correct: While the message for s is being rerouted from x toward the swap edge es [x], the link es [x] might recover (i.e., come up again) and another link on the may go down. Thus, the message will again be rerouted and might continue to do so if a “bad” sequence of recovery failure occurs.

ROUTING IN STATIC SYSTEMS: COMPACT TABLES

261

In other words, not only the message will not reach s through a shortest path from the ﬁrst point-of-failure, but it will not reach s at all as long as there is a change. It might be argued that such a sequence of events is highly unlikely, but it is possible. Thus, again, Question. What, if anything, can be guaranteed? As in the case of adaptive routing, the only guarantee is that if the changes stop (or there are no changes for a long period of time), then messages will be (during that time) correctly delivered through point-of-failure shortest paths. 4.4 ROUTING IN STATIC SYSTEMS: COMPACT TABLES There are systems that are static in nature; for example, if Total Reliability holds, no changes will occur in the network topology. We will consider static also any system where the routing table, once constructed, cannot be modiﬁed (e.g., because they are hardcoded/hardwired). Such is, for example, any system etched on a chip; should faults occur, the entire chip will be replaced. In these systems, an additional concern in the design of shortest path routing tables is their size, that is, an additional design goal is to construct table that are as small as possible. 4.4.1 The Size of Routing Tables The full routing table can be quite large. In fact, for each of its n − 1 destinations, it contains the speciﬁcation (and the cost) of the shortest path to that destination. This means that each entry possibly contains O(n log w) bits, where w ≥ n is the range of the entities’ names, for a total table size of O(n2 log w) bits. Assuming the best possible case, that is, w = n, the number of bits required to store all the n full routing tables is SFULL = O(n3 log n). For large n, this is a formidable amount of space just to store the routing tables. Observe that for any destination, the ﬁrst entry in the shortest path will always be a link to a neighbor. Thus, it is possible to simplify the routing table by specifying for each destination y only the neighbor of x on the shortest path to it. Such a table is called short. For example, the short routing table for s in the network of Figure 4.1 is shown in Table 4.8. In its short representation, each entry of the table of an entity x will contain log w bits to represent the destination’s name and another log w bits to represent the neighbor’s name. In other words, the table contains 2(n − 1) log w bits. Assuming the best possible case, that is, w = n , the number of bits required to store all the routing tables is 2n(n − 1) log n.

262

MESSAGE ROUTING AND SHORTEST PATHS

TABLE 4.8: Short Representation of RT(s) Destination

Neighbor

h k c d e f

h h c c e e

This amount of space can be further reduced if, instead of the neighbors’ names we use the local port numbers leading to them. In this case, the size will be (n − 1) (log w + log px ) bits, where px ≥ deg(x) is the range of the local port numbers of x. Assuming the best possible case, that is, w = n and px = deg(x) for all x, this implies that the number of bits required to store all the routing tables is at least SSHORT =

x

(n − 1) log deg(x) = (n − 1) log ⌸x deg(x),

which can be still rather large. Notice that the same information can be represented by listing for each port the destinations reached via shortest path through that port; for example, see Table 4.9. This alternative representation of RT(x) uses only deg(x) + (n − 1) log(n) bits for a total of SALT =

(deg(x) + (n − 1) log n) = 2m + n(n − 1) log n.

(4.18)

x

It appears that there is not much more that can be done to reduce the size of the table. This is, however, not the case if we, as designers of the system, had the power to choose the names of the nodes and of the links. 4.4.2 Interval Routing The question we are going to ask is whether it is possible to drastically reduce this amount of storage if we know the network topology and we have the power of choosing the names of the nodes and the port labels. An Example: Ring Networks Consider for example a ring network, and assume for the moment that all links have the same cost. TABLE 4.9: Alternative Short Representation of RT(s) Port

Destinations

ports (h) ports (c) ports (e)

h, k c, d e, f

ROUTING IN STATIC SYSTEMS: COMPACT TABLES

0

263

right 1

6

2

5

right

3, 4, 5, 6

left

7, 8, 0, 1

3

5 4

(a)

(b)

FIGURE 4.17: (a) assigning names and labels; (b) Routing table of node 2.

Suppose that we assign as names to the nodes consecutive integers, starting from 0 and continuing clockwise, and we label the ports right or left depending on whether or not they are in the clockwise direction. See Figure 4.17(a). Concentrate on node 0. This node, like all the others, has only two links. Thus, whenever 0 has to route a message for z > 0, it must just decide whether to send it to right or to left. Observe that the choice will be right for 1 ≤ z ≤ n/2 and left for n/2 + 1 ≤ z ≤ n − 1. In other words, the destinations are consecutive integers (modulo n). This is true not just for node 0: If x has to route a message for z = x, the choice will be right if z is in the interval x + 1, x + 2, . . . x + n/2 and left if z is in the interval x + n/2 + 1, . . . , x − 1, where the operations are modulo n. See Figure 4.17(b). In other words, in all these routing tables, the set of destinations associated to a port is an interval of consecutive integers, and, in each table, the intervals are disjoint. This is very important for our purpose of reducing the space. In fact, an interval has a very short representation: It is sufﬁcient to store the two end values, that is, just 2 log n bits. We can actually do it with just log n bits; see Exercise 4.6.43. As a table consists just of two intervals, we have routing tables of 4 log n bits each, for a grand total of just 4n log n. This amount should be contrasted with the one of Expression 4.18 that, in the case of rings, becomes n2 log n + l.o.t.. In other words, we are able to go from quadratic

264

MESSAGE ROUTING AND SHORTEST PATHS

to just linear space requirements. Note that it is true even if the costs of the links are not all the same; see Exercise 4.6.44. The phenomenon we have just described is not isolated, as we will discuss next. Routing With Intervals Consider the names of the nodes in a network G. Without any loss of generality, we can always assume that the names are consecutive positive integers, starting from 0, that is, the set of names is Zn = {0, 1, . . . , n − 1}. Given two integers j, k ∈ Zn , we denote by (j, k) the sequence (j, k) = j, j + 1, j + 2, . . . , k if j < k (j, k) = j, j + 1, j + 2, . . . , n − 1, 0, 1, . . . , k if j ≥ k. Such a sequence (j, k) is called a circular interval of Zn ; the empty interval ∅ is also an interval of Zn . Suppose that we are able to assign names to the nodes so that the shortest path routing tables for G have the following two properties. At every node x, 1. interval: for each link incident to x, the (names of the) destinations associated to that link form a circular interval of Zn ; 2. disjointness: each destination is associated to only one link incident to x. If this is the case, then we can have for G a very compact representation of the routing tables, like in the example of the ring network. In fact, for each link the set of destinations is an interval of consecutive integers, and, like in the ring, the intervals associated to the links of a given nodes are all disjoint. In other words, each table consists of a set of intervals (some of them may be empty), one for each incident link. From the storage point of view, this is very good news because we can represent such intervals by just their start values (or, alternatively, by their end values). In other words, the routing table of x will consist of just one entry for each of its links. This means that the amount of storage for its table is only deg(x) log n bits. In turn, this means that the number of bits used in total to represent all the routing tables will be just SINTERVAL =

deg(x) log n = 2m log n.

(4.19)

x

How will the routing mechanism then work with such tables? Suppose x has a message whose destination is y. Then x checks in its table which interval y is part of (as the intervals are disjoint, y will belong to exactly one) and sends the message to the corresponding link. Because of its nature, this approach is called interval routing. If it can be done, as we have just seen, it allows for efﬁcient shortest-path routing with a minimal amount of storage requirements.

ROUTING IN STATIC SYSTEMS: COMPACT TABLES

265

15

3

2

0

8

6

1

14

9

7

4

5

10

13

11

12

FIGURE 4.18: Naming for interval routing in trees

It, however, requires that we, as designers, ﬁnd an appropriate way to assign names to nodes so that the interval and disjointness properties hold. Given a network G, it is not so obvious how to do it or whether it can be done at all. Tree Networks First of all we will consider tree networks. As we will see, in a tree it is always possible to achieve our goal and can actually be done in several different ways. Given a tree T, we ﬁrst of all choose a node s as the source, transforming T into the tree T (s) rooted in s; in this tree, each node x has a parent and some children (possibly none). We then assign as names to the nodes consecutive integers, starting from 0, according to the post-order traversal of T (s), for example, using procedure P ost Order Naming(x, k) begin Unnamed Children(x):= Children(x); while Unnamed Children(x) = ∅ do y ← Unnamed Children(x); P ost Order N aming(y, k) endwhile myname:= k; k:= k + 1; end started by calling Post Order Naming(s, 0). This assignment of names has several properties. For example, any node has a larger name than all its descendents. More importantly, it has the interval and disjointness properties (Exercise 4.6.48). Informally, the interval property follows is because when executing Post Order Naming with input (x, k), x and its descendents will be given as names consecutive integers starting from k. See for example Figure 4.19.

266

MESSAGE ROUTING AND SHORTEST PATHS

< 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3 >

8

< 4, 5, 6 >

FIGURE 4.19: Disjoint intervals

Special Networks Most regular network topologies we have considered in the past can be assigned names so that interval routing is possible. This is for example the case of the p × q mesh and torus, hypercube, butterﬂy, and cube-connected-cycles; see Exercises 4.6.51 and 4.6.52. For these networks the construction is rather simple. Using a more complex construction, names can be assigned so that interval routing can be done also in any outerplanar graph (Exercise 4.6.53); recall that a graph is outerplanar if it can be drawn in the plane with all the nodes lying on a ring and all edges lying in the interior of the ring without crossings. Question. Can interval routing be done in every network? The answer is unfortunately No. In fact there exist rather simple networks, the socalled globe outerplanar graph (one is shown in Figure 4.20), for which interval routing is impossible (Exercise 4.6.55). Multi-Intervals As we have seen, interval routing is a powerful technique but the classes of networks in which it is possible are rather limited. To overcome somehow this limitation without increasing excessively the size of the routing table an approach is to associate to each link a small number of intervals. An interval-routing scheme that uses up to k intervals per edge is called a k-intervals routing scheme.

FIGURE 4.20: A globe graph: interval routing is not possible.

BIBLIOGRAPHICAL NOTES

267

Clearly, with enough intervals we can ﬁnd a scheme for every connected graph. The question is whether this can be achieved with a small k. The answer again is No. In fact, there are graphs where O(n) intervals are needed in each edge (Exercise 4.6.56). Suboptimal Interval Routing A reason why it is impossible to do interval routing in all graphs is that we require the tables to provide shortest path. The situation changes if we relax this requirement. If we ask the tables to provide us just with a path to destination, not necessarily the shortest one, then we can use the approach already discussed in Section 4.2.6: We construct a single spanning tree T of the network G and use only the edges of T for routing. Once we have the tree T, we then assign the names to the nodes using the naming algorithm for trees that provides interval routing. In this way, we obtain for G the very compact routing tables provided by interval routing. Clearly, the interval routing mechanism so constructed is optimal (i.e., shortest path) for the tree T but not necessarily so for the original network G. This means that suboptimal interval routing is always possible in any network. Question. How much worse can a path provided by this approach be than the shortest one to the destination? If we choose as tree T a breadth-ﬁrst spanning tree rooted in a center of the graph G, then its diameter is at most twice the diameter of the original graph (the worst case is when G is a ring). This means that the longest route is never more than 2 diam(G). We can extend this approach by allowing the longest route to be within a factor β ≤ 2 of the diameter of G and by using more than one interval. We have seen that it is possible to obtain β = 2 using a single interval per edge. The question then becomes whether using more intervals we can obtain a better scheme (i.e., a smaller β). The answer is again not very positive; for example, to have the longest route shorter than 3 2 diam(G), then we need O(log n) labels (Exercise 4.6.58). 4.5 BIBLIOGRAPHICAL NOTES The construction of routing table is a prerequisite for the functioning of many networks. One of the earliest protocols is due to William Tajibnapis [31]. The basic MapGossip for the construction of all routing tables is due to Eric Rosen [29]. Protocol IteratedConstruction is the distributed version of Bellman’s sequential algorithm designed by Lestor Ford and D. Fulkerson [13]; from the start it has been the main routing algorithm in the Internet. The same cost as IteratedConstruction, O(n2 m), was incurred by several other protocols designed much later, including the ones of Philip Merlin and Adrian Segall [25] and by Jayadev Misra and Mani Chandy [22]. The improvement to O(n3 ) is due to Baruch Awerbuch, who designed a protocol to construct a single shortest path tree

268

MESSAGE ROUTING AND SHORTEST PATHS

using O(n2 ) message [6]. The same bound is achieved by protocol PT Construction, the efﬁcient distributed implementation of Dijkstra’s sequential algorithm designed by K. Ramarao and S. Venkatesan [28]. The even more efﬁcient Protocol SparserGossip is due to Yeuda Afek and Moty Ricklin [1]. A protocol for systems allowing long messages was designed by Sam Toueg with cost O(nm) [32]; the reduction to O(n2 ) is easy to achieve using protocol MapGossip by Eric Rosen [29] (Exercise 4.6.4), constructing, however, complete maps at each entity; the same cost but with less local storage (Exercise 4.6.18) has been obtained by S. Haldar [20]. The distributed construction of min-hop spanning trees has been extensively investigated. Protocol BF (known as the “Coordinated Minimum Hop Algorithm”) is due to Bob Gallager [17]; a different protocol with the same cost was independently designed by To-Yat Cheung [8]. Also to Gallager [17] is due the idea of reducing time by partitioning the layers of the breadth-ﬁrst tree into groups (Section 4.2.5) and a series of time-messages tradeoffs. Protocol BF Layers has been designed by Greg Frederickson [15]. The problem of reducing time while maintaining a reasonable message complexity has been investigated by Baruch Awerbuch [3], Baruch Awerbuch and Bob Gallager [5], and Y. Zhu and To-Yat Cheung [35]. The near-optimal bounds (Exercise 4.6.26) have been obtained by Baruch Awerbuch [4]. The suboptimal solutions of center-based and median-based routing were ﬁrst discussed in details by David Wall and Susanna Owicki [34]. The lower-bound on average edge-stretch and the construction of spanning trees with low average edgestretch (Exercises 4.6.34, 4.6.35 and 4.6.36) are due to Noga Alon, Richard Karp, David Peleg, and Doug West [2]. The idea of point-of-failure rerouting was suggested independently by Enrico Nardelli, Guido Proietti, and Peter Widmayer[27] and by Hiro Ito, Kazuo Iwama, Yasuo Okabe, and Takuya Yoshihiro [21]. The distributed algorithm for computing the swap edges (Exercise 4.6.41) was designed by Paola Flocchini, Linda Pagli, Tony Mesa, Giuseppe Prencipe, and Nicola Santoro [12]. The idea of compact routing was introduced by Nicola Santoro and Ramez Kathib [30], who designed the interval routing for trees; this idea was then extended by Jan van Leeuwen and Richard Tan [24]. The interval routing for outerplanar graphs (Exercise 4.6.53) is due to Greg Frederickson and Ravi Janardan [16]. The more restrictive notion of linear interval routing (Exercise 4.6.54 and Problem 4.6.1) was introduced and studied by Erwin Bakker, Jan van Leeuwen, and Richard Tan [7]; the more general notion of Boolean routing was introduced by Michele Flammini, Giorgio Gambosi, and Sandro Salomone [11]. Several issues of compact routing have been investigated, among others, by Greg Frederickson and Ravi Janardan [16], Pierre Fraigniaud and Cyril Gavoille [14], and Cyril Gavoille and David Peleg [19]. Exercises 4.6.56, 4.6.57, and 4.6.58 are due to Cyril Gavoille and Eric Guevremont [18], Evangelos Kranakis and Danny Krizanc [23], and Savio Tse and Francis Lau [33], respectively. Characterizations of networks supporting interval routing are due to Lata Narayanan and Sunil Shende [26], Tamar Eilam, Shlomo Moran, and Shmuel Zaks [9], and Michele Flammini, Giorgio Gambosi, Umberto Nanni, and Richard Tan [10].

EXERCISES, PROBLEMS, AND ANSWERS

269

4.6 EXERCISES, PROBLEMS, AND ANSWERS 4.6.1 Exercises Exercise 4.6.1 Write the set of rules corresponding to Protocol Map Gossip described in Section 4.2.1. Exercise 4.6.2 () Consider a tree network where each entity has a single item of information. Determine the time costs of gossiping. What would the time costs be if each entity x initially has deg(x) items? Exercise 4.6.3 Consider a tree network where each entity has f (n) items of information. Assume that messages can contain g(n) items of information (instead of O(1)); with how many messages can gossiping be performed? Exercise 4.6.4 Using your answer to question 4.6.3, with how many messages can all routing tables be constructed if g(n) = O(n)? Exercise 4.6.5 Consider a tree network where each entity has f (n) items of information. Assume that messages can contain g(n) items of information (instead of O(1)); with how many messages can all items of information be collected at a single entity? Exercise 4.6.6 Using your answer to question 4.6.5, with how many messages can all routing tables be constructed at that single entity if g(n) = O(n)? Exercise 4.6.7 Write the set of rules corresponding to Protocol Iterated Construction described in Section 4.2.2. Implement and properly test your implementation. Exercise 4.6.8 Prove that Protocol Iterated Construction converges to the correct routing tables and will do so after at most n − 1 iterations. Hint: Use induction to prove that Vxi [z] is the cost of the shortest path from x to z using at most i hops. Exercise 4.6.9 We have assumed that the cost of a link is the same in both directions, that is, θ (x, y) = θ (y, x). However, there are cases when θ(x, y) can be different from θ (y, x). What modiﬁcations have to be made so that protocol Iterated Construction works correctly also in those cases? Exercise 4.6.10 In protocol PT Construction, no action is provided for an idle entity receiving an Expand message. Prove that such a message will never be received in such a state. Exercise 4.6.11 In procedure Compute Local Minimum of protocol PT Construction, an entity might set path length to inﬁnity. Show that if this happens, this entity will set path length to inﬁnity in all subsequent iterations.

270

MESSAGE ROUTING AND SHORTEST PATHS

Exercise 4.6.12 In protocol PT Construction, each entity will eventually set path length to inﬁnity. Show that when this happens to a leaf of the constructed tree, that entity can be removed from further computations. Exercise 4.6.13 Modify protocol PT Construction so that it constructs the routing table RT(s) of the source s. Exercise 4.6.14 We have assumed that the cost of a link is the same in both directions, that is, θ(x, y) = θ (y, x). However, there are cases when θ (x, y) can be different from θ (y, x). What modiﬁcations have to be made so that protocol PT Construction works correctly also in those cases? Exercise 4.6.15 Prove that any G has a (log n, n) sparser. Exercise 4.6.16 Show how to construct a (log n, n) sparser with O(m + n log n) messages. Exercise 4.6.17 Show how to use a (log n, n) sparser to solve the all-pairs shortest paths problem in O(n2 log n) messages. Exercise 4.6.18 Assume that messages can contain O(n) items of information (instead of O(1)). Show how to construct all the shortest path trees with just O(n2 ) messages. Exercise 4.6.19 Prove that, after iteration i − 1 of protocol BF Construction, (a) all the nodes at distance up to i − 1 are part of the tree; (c) each node at distance i − 1 knows which of its neighbors are at distance i − 1. Exercise 4.6.20 Write the set of rules corresponding to protocol BF described in Section 4.2.2. Implement and properly test your implementation. Exercise 4.6.21 Write the set of rules corresponding to protocol BF Levels. Implement and properly test your implementation. Exercise 4.6.22 Let Explore(j, k) be the ﬁrst message x accepts in the expansion phase of protocol BF Levels. Prove that the number of times x will change its level in this phase is at most j − t + 1 < l. Exercise 4.6.23 Prove that in the expansion phase of an iteration of protocol BF Levels, all nodes in levels t + 1 to t + l are reached and attached to the existing fragment, where t is the level of the sources (i.e., the leaves in the current fragment). Exercise 4.6.24 Consider protocol BF Levels when l = d(G). Show how to obtain the same message and time complexity without any a priori knowledge of d(G).

EXERCISES, PROBLEMS, AND ANSWERS

271

Exercise 4.6.25 Prove that if we choose l = d(G) in protocol BF Levels, then in any synchronous execution the number of messages will be exactly 2m + n − 1. Exercise 4.6.26 () Show how to construct a breadth-ﬁrst spanning tree in time O(d(G)1+ ) using no more than O(m1+ ) messages, for any > 0. Exercise 4.6.27 Let c be a center of G and let SPT(c) be the shortest path tree of c. Prove that diam(G) ≤ 2 diam(SPT(c)). Exercise 4.6.28 Let T be a spanning tree of G. Prove that |T [y − x]|w(x, y) = u,v∈T dT (u,v).

(x,y)∈T

|T [x − y]|

Exercise 4.6.29 (median-based routing) Let z be a median of G (i.e., a node for which the sum of distances to all other nodes is minimized) and let PT(z) be the shortest path tree of z. Prove that Trafﬁc(PT(z)) ≤ 2 Trafﬁc(T ), where T is the spanning tree of G for which Trafﬁc is minimized. Exercise 4.6.30 Consider a ring network Rn with weighted edges. Prove or disprove that PT(c) = MSP(Rn ), where c is a center of Rn and MSP(Rn ) is the minimum-cost spanning tree of Rn . Exercise 4.6.31 Consider a ring network Rn with weighted edges. Let c and z be a center and a median of Rn , respectively. 1. For each of the following spanning trees of Rn , compare the stretch factor and the edge-stretch factor: PT(c), PT(z), and the minimum-cost spanning tree MSP(Rn ). 2. Determine bounds on the average edge-stretch factor of PT(c), PT(z), and MSP(Rn ). Exercise 4.6.32 () Consider a a × a square mesh Ma,a where all costs are the same. 1. Is it possible to construct two spanning trees T and T such that σ (T ) < σ (T ) but (T ) > (T ) ? Explain. 2. Is it possible to construct two spanning trees T and T such that σ (T ) < σ (T ) but (T ) > (T ) ? Explain. Exercise 4.6.33 Consider a square mesh Ma,a where all costs are the same. Construct two spanning trees T and T such that σ (T ) < σ (T ) but (T ) > (T ). Exercise 4.6.34 () Show that there are graphs G with unweighted edges where G (T ) = ⍀(log n) for every spanning tree T of G. Exercise 4.6.35 () Design an efﬁcient protocol for computing a spanning tree with low average edge-stretch of a network G with unweighted edges.

272

MESSAGE ROUTING AND SHORTEST PATHS

Exercise 4.6.36 () Design an efﬁcient protocol for computing a spanning tree with low average edge-stretch of a network G with weighted edges. Exercise 4.6.37 () Design a protocol for computing the secondary paths of a node x. You may assume that the shortest-path tree PT(x) has already been constructed and that each node knows its and its neighbors’ distance from x. Your protocol should use no more messages than that required to construct PT(x). Exercise 4.6.38 (split horizon) () Consider the following technique, called split horizon, for solving the count-to-inﬁnity problem discussed in Section 4.3.1: During an iteration, a node a does not send its cost for destination c to its neighbor b if b is the next node in the “best” path (so far) from a to c. In the example of Figure 4.13, in the ﬁrst iteration y does not send its cost for w to z, and thus z will correctly set its cost for w to K. In the next two iterations y and x will correctly set their cost for w to K + 1 and K + 2, respectively. Prove or disprove that split horizon solves the count-to-inﬁnity problem. Exercise 4.6.39 (split horizon with poison reverse) () Consider the following technique, called split horizon with poison reverse, for solving the count-to-inﬁnity problem discussed in Section 4.3.1: During an iteration, a node a sends its cost for destination c set to ∞ to its neighbor b if b is on the “best” path (so far) from a to c. Prove or disprove that split horizon with poison reverse solves the count-to-inﬁnity problem. Exercise 4.6.40 () Design an efﬁcient protocol that, given a shortest-path spanning tree PT(s), determines an optimal swap for every edge in PT(s): At the end of the execution, every node x knows the optimal swap edge for its incident link es [x]. Your protocol should use no more than O(nh(s)) messages, where h(s) is the height of PT(x). Exercise 4.6.41 () Show how to answer Exercise 4.6.40 using no more than O(n (s)) messages, where n (s) is the number of edges in the transitive closure of PT(x). Exercise 4.6.42 Let e = (u,v) be the optimal swap edge that x has computed for es [x]. Prove that, if es [x] fails, to achieve point-of-failure shortest path rerouting, x must send the message for s to the incident link (pv (x), x). Exercise 4.6.43 Show how to represent the intervals of a ring with just log n bits per interval. Exercise 4.6.44 Show how that the intervals of a ring can be represented with just log n bits per interval, even if the costs of the links are not all the same.

EXERCISES, PROBLEMS, AND ANSWERS

273

Exercise 4.6.45 Let G be a network and assume that we can assign names to the nodes so that in each routing table, the destinations for each link form an interval. Determine what conditions the intervals must satisfy so that they can be represented with just log n bits each. Exercise 4.6.46 Redeﬁne properties interval and disjointness in case the n integers used as names are not consecutive, that is, they are chosen from a larger set Zw , w > n. Exercise 4.6.47 Show an assignment of names in a tree that does not have the interval property. Does there exists an assignment of distinct names in a tree that has the interval property but not the disjointness one? Explain your answer. Exercise 4.6.48 Prove that in a tree, the assignment of names by Post-Order traversal has both interval and disjointness properties. Exercise 4.6.49 Prove that in a tree, also the assignment of names by Pre-Order traversal has both interval and disjointness properties. Exercise 4.6.50 Determine whether interval routing is possible in the regular graph shown in Figure 4.21. If so, show the routing table; otherwise explain why. Exercise 4.6.51 Design an optimal interval routing scheme for p × q mesh and torus. How many bits of storage will it require? Exercise 4.6.52 Design an optimal interval routing scheme for d-dimensional (a) hypercube, (b) butterﬂy, and (c) cube-connected cycles. How many bits of total storage will each require?

FIGURE 4.21: The regular graph used in Exercise 4.6.55.

274

MESSAGE ROUTING AND SHORTEST PATHS

Exercise 4.6.53 () Show how to assign names to the nodes of an outerplanar graph so that interval routing is possible. Exercise 4.6.54 () If for every x all the intervals in its routing table are strictly increasing (i.e., there is no “wraparound” node ”0), the interval routing is called linear. Prove that there are networks for which there exists interval routing but linear interval routing is impossible. Exercise 4.6.55 Prove that in the globe graph of Figure 4.20, interval routing is not possible. Exercise 4.6.56 () Consider the approach of k-interval routing. Prove that there are graphs that require k = O(n) intervals. Exercise 4.6.57 () Consider allowing each route to be within a factor α from optimal. Prove that if we want α = 2, there are graphs that require O(n2 ) bits of storage at each node. Exercise 4.6.58 () Consider allowing the longest route to be within a factor β from the diameter diam(G) of the network, using at most k labels per edge. Prove that if we want β < 23 , then there are graphs that require O(log n) bits of storage at each node. 4.6.2 Problems Problem 4.6.1 Linear Interval Routing. () If for every x all the intervals in its routing table are strictly increasing (i.e., there is no “wraparound” node 0), the interval routing is called linear. Characterize the class of graphs for which there exists a linear interval routing. 4.6.3 Answers to Exercises Partial Answer to Exercise 4.6.26. √ Choose the size of the strip to be k = d(G). A strip cover is a collection of trees that span all the source nodes of a strip. In iteration i, ﬁrst of all construct a “good” cover of strip i. Answer to Exercise 4.6.29. Observe that for any spanning tree T of G, Trafﬁc(T ) = u,v∈V dT (u,v) (Exercise 4.6.28). Let SumDist(x) = u∈V dG (u, x); clearly Trafﬁc(T ) ≥ x∈V SumDist(x). Let z be a median of G (i.e., a node for which SumDist T raff ic(T ). Thus we have that is minimized); then SumDist(z) ≤ n1 d (u, v ) ≤ ≤ Trafﬁc(PT(z)) = PT(z) u,v ∈V u,v ∈V (dPT(z) (u, z) + dPT(z) (z, v)) (n − 1) u∈V (dPT(z) (u, z) + (n − 1) v∈V (dPT(z) (v, z) = 2(n − 1)SumDist(z) ≤ 2Trafﬁc(T ).

BIBLIOGRAPHY

275

FIGURE 4.22: Graph with interval routing but where no linear interval routing exists.

Answer to Exercise 4.6.43. In the table of node x, the interval associated to right always starts with x + 1 while the one associated to left always ends with x − 1. Hence, for each interval, it is sufﬁcient to store only the other end value. Partial Answer to Exercise 4.6.54. Consider the graph shown in Figure 4.22. BIBLIOGRAPHY [1] Y. Afek and M. Ricklin. Sparser: a paradigm for running distributed algorithms. Journal of Algorithms, 14(2):316–28, March 1993. [2] N. Alon, R.M. Karp, D. Peleg, and D. West. A graph-theoretic game and its application to the k-server problem. SIAM Journal of Computing, 24:78–100, 1995. [3] B. Awerbuch. Reducing complexities of the distributed max-ﬂow and breadth-ﬁrst-search algorithms by means of network synchronization. Networks, 15:425–437, 1985. [4] B. Awerbuch. Distributed shortest path algorithms. In Proc. 21st Ann. ACM Symp. on Theory of Computing, pages 490–500, 1989. [5] B. Awerbuch and R.G. Gallager. A new distributed algorithm to ﬁnd breadth ﬁrst search trees. IEEE Transactions on Information Theory, 33:315–322, 1987. [6] B. Awerbuch. Complexity of network synchronization. Journal of the ACM, 32(4): 804–823, October 1985. [7] E.M. Bakker, Jan van Leeuwen, and Richard Tan. Linear interval routing. Algorithms Review, 2(2):45–61, 1991. [8] T.-Y. Cheung. Graph traversal techniques and the maximum ﬂow problem in distributed computation. IEEE Transactions on Software Engineering, 9:504–512, 1983. [9] T. Eilam, S. Moran, and S. Zaks. The complexity of the characterization of networks supporting shortest-path interval routing. In 4th International Colloquium on Structural Information and Communication Complexity, pages 99–11, Ascona, 1997. [10] M. Flammini, G. Gambosi, U. Nanni, and R.B. Tan. Characterization results of all shortest paths interval routing schemes. Networks, 37(4):225–232, 2001. [11] M. Flammini, G. Gambosi, and S. Salomone. Boolean routing. In 7th International Workshop on Distributed Algorithms, pages 219–233, Lausanne, 1993. [12] P. Flocchini, L. Pagli, T. Mesa, G. Prencipe, and N. Santoro. Point-of-failures shortest path rerouting: computing the optimal swaps distributively. IEICE Transactions, 2006. [13] L. R. Ford and D. R. Fulkerson. Flows in Networks. Princeton University Press, 1962.

276

MESSAGE ROUTING AND SHORTEST PATHS

[14] P. Fraigniaud and C. Gavoille. Interval routing schemes. Algorithmica, 21(2):155–182, 1998. [15] G.N. Frederickson. A distributed shortest path algorithm for a planar network. Information and Computation, 86(2):140–159, June 1990. [16] G.N. Frederickson and R. Janardan. Designing networks with compact routing tables. Algorithmica, 3:171–190, June 1988. [17] R.G. Gallager. Distributed minimum hop algorithms. Technical Report LIDS-P-1175, Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, 1982. [18] C. Gavoille and E. Guevremont. Worst case bounds for shortest path intervalrouting. Journal of Algorithms, 27:1–25, 1998. [19] C. Gavoille and D. Peleg. The compactness of interval routing. SIAM Journal on Discrete Mathematics, 12(4):459–473, 1999. [20] S. Haldar. An ‘all pairs shortest paths’ distributed algorithm using 2n2 messages. In Proceedings of the 19th International Workshop on Graph-Theoretic Concepts in Computer Science (WG’93), Utrecht, Netherlands, June 1993. [21] H. Ito, K. Iwama, Y. Okabe, and T. Yoshihiro. Single backup table schemes for shortestpath routing. Theoretical Computer Science, 333:347–353, 2004. [22] J. Misra K.M. Chandi. Distributed computations on graphs: shortest path algorithms. Communications of ACM, 25(11):833–837, November 1982. [23] Evangelos Kranakis and Danny Krizanc. Lower bounds for compact routing. In 13th Symposium on Theoretical Aspects of Computer Science, pages 529–540, Grenoble, feb 1996. [24] J. van Leeuwen and R.B. Tan. Interval routing. The Computer Journal, 30:298–307, 1987. [25] P.M. Merlin and A. Segall. A failsafe distributed routing protocol. IEEE Transactions on Communications, 27(9):1280–1287, sept 1979. [26] L. Narayanan and S. Shende. Characterization of networks supporting shortest-path interval labelling schemes. In 3rd International Colloquium on Structural Information and Communication Complexity, pages 73–87, 1996. [27] E. Nardelli, G. Proietti, and P. Widmayer. Swapping a failing edge of a single source shortest paths tree is good and fast. Algoritmica, 35:56–74, 2003. [28] K.V.S. Ramarao and S. Venkatesan. On ﬁnding and updating shortest paths distributively. Journal of Algorithms, 13(2):235–257, 1992. [29] E.C. Rosen. The updating protocol of Arpanet’s new routing algorithm. Computer Networks, 4:11–19, 1980. [30] N. Santoro and R. Khatib. Labeling and implicit routing in networks. The Computer Journal, 28:5–8, 1985. [31] W.D. Tajibnapis. A correctness proof of a topology information maintenance protocol for a distributed computer network. Communications of the ACM, 20(7):477–485, 1977. [32] S. Toueg. An all-pairs shortest-path distributed algorithm, 1980. [33] S.S.H. Tse and F.C.M. Lau. On the space requirement of interval routing. IEEE Transactions On Computers, 48(7):752–757, July 1999. [34] D.W. Wall and S. Owicki. Construction of centered shortest-path trees in networks. Networks, 13(2):207–332, 1983. [35] Y. Zhu and T.-Y. Cheung. A new distributed breadth-ﬁrst-search algorithm. Information Processing Letters, 25:329–333, 1987.

CHAPTER 5

Distributed Set Operations

5.1 INTRODUCTION In a distributed computing environment, each entity has its own data stored in its local memory. Some data items held by one entity are sometimes related to items held by other entities, and we focus and operate on them. An example is the set of the ids of the entities. What we did in the past was to operate on this set, for example, by ﬁnding the smallest id or the largest one. Another example is the set of the single values held by each entity, and the operation was to ﬁnd the overall rank of each of those values. In all these examples, the relevant data held by an entity consist of just a single data item. In general, an entity x has a set of relevant data Dx . The union of all these local sets forms a distributed set of data D=

Dx

(5.1)

x

and the tuple Dx1 , Dx2 , . . . , Dxn describes the distribution of D among the entities x1 , x2 . . . , xn . Clearly there are many different distributions of the same distributed set. There are two main types of operations that can be performed on a distributed set: 1. queries and 2. updates. A query is a request for some information about the global data set D, as well as about the individual sets Dx forming D. A query can originate at any entity. If the entity where the query originates has locally the desired information, the query can be answered immediately; otherwise, the entity will have to communicate with other entities to obtain the desired information. As usual, we are concerned with the communication costs, rather than the local processing costs, when dealing with answering a query. Design and Analysis of Distributed Algorithms, by Nicola Santoro Copyright © 2007 John Wiley & Sons, Inc.

277

278

DISTRIBUTED SET OPERATIONS

An update is a request to change the composition of the distributed set. There are two basic updates: the request to add a new element to the set, an operation called insertion; and the request to remove an element from the set, an operation called deletion. The third basic update is the request to change the value of an existing item of the set, an operation called change. Note that a change can be seen as a deletion of the item with the old value followed by an insertion of an item with the new value. There are many distributions of the same set. In a distribution, the local sets are not necessarily distinct or disjoint. Two extreme cases serve to illustrate the spectrum of distributions and the impact that the structure of the distribution has when handling queries and performing updates. One extreme distribution is the partition where the local sets have no elements in common: Di ∪ Dj = ∅,

i = j.

At the other end of the spectrum is the multiple-copy distribution where every entity has a copy of the entire data set. ∀i

Di = D.

A multiple-copy distribution is excellent for queries but poor for updates. Queries are easy because all entities possess all the data; hence every answer can be derived locally, without any communication. However, an update will require modiﬁcation of the data held at each and every entity; in the presence of concurrent updates, this process becomes exceedingly difﬁcult. The situation is reversed in the partition. As each data item is located in only one site, answering a query requires searching through all potential entities to ﬁnd the one that has locally stored the required data. By contrast, to perform an update is easy because the change is performed in only the entity having the item, and there is no danger of concurrent updates on the same item. In most cases, the data are partially replicated; that is, some data items are stored at more than one entities while others are to be found at only one entity. This means that, in general, we have to face and deal with the problems of both extremes, partition and multiple-copy distributions, without the advantages of either one. In the following we will ﬁrst focus on an important class of queries, called order statistics; the problem of answering such queries is traditionally called selection. As selection as well as most queries is more easily and efﬁciently solved if the distribution is sorted, we will also investigate the problem of sorting the distributed data. We will then concentrate on distributed set operations; that is, computing union, intersection, and differences of the local sets. The ability to perform such operations has a direct impact on the processing of complex queries usually performed in databases. To focus on the problems, we will assume the standard set of restrictions IR (Connectivity, Total Reliability, Bidirectional Links, Distinct Identiﬁers). For simplicity, as local processing time does not interest us when we consider the cost of our protocols, we will assume that all of the data stored at an entity are sorted.

DISTRIBUTED SELECTION

279

IMPORTANT. As we consider arbitrary distributions of the data set, it is possible that a data item a is in more than one local set. As we assume ID, we can use the ids of the entities to break ties and create a total order even among copies of the same value; so, for example, if a is in both Dx and Dy where id(x) > id(y), then we can say that the copy of a in Dx is “greater” than the one in Dy . In this way, if so desired, the copies can also be considered distinct and included in the global data set D by the union operation (5.1). 5.2 DISTRIBUTED SELECTION 5.2.1 Order Statistics Given a totally ordered data set D of size N distributed among the entities, the distributed selection problem is the general problem of locating D[K], the Kth smallest element of D. Problems of this type are called order statistics, to distinguish them from the more conventional cardinal statistics (e.g., average, standard deviation, etc.). Unlike cardinal statistics, ordinal ones are more difﬁcult to compute in a distributed environment. We have already seen and examined the problem of computing D[1] (i.e., the minimum value), and D[N ] (i.e., the maximum value). Other elements whose ranks are of particular importance are the medians of the data set. If N is odd, there is only one median, D[ N/2 ]. If N is even, there are two medians: the lower median D[N/2] and the upper median D[N/2 + 1]. Unlike the case of D[1] and D[N ], the problem of ﬁnding the median(s) and of K selection for an arbitrary value of K is not simple, and considerably more expensive to resolve. The complexity of the problem depends on many parameters including the number n of entities, the size N = |D| of the set, the number nx = |Dx | of elements stored at an entity x, the rank K of the element being sought, and the topology of the network. Before proceeding to examine strategies for its solution, let us introduce a fundamental property and a basic observation that will be helpful in our designs. Let D[K] denote the Kth largest element of the data set. Then − K + 1] Property 5.2.1 D[K] = D[N Thus looking for the Kth smallest is the same as looking for the (N − K + 1)th largest. Consider, for example, a set of 10 distinct elements; the 4th smallest is clearly the 7th largest; see Figure 5.1 where the elements d1 , . . . , d10 of the set are represented and sorted in an increasing order. This fact has many important consequences, as we will see later. The other useful tool is based on the trivial observation. Property 5.2.2 Dx [K + 1] > D[K] > D x [N − K + 2]. This means that, if an entity x has more than K items, it needs only to consider the smallest K items. Similarly, if x has more than (N − K + 1) items, it needs only to consider the largest (N − K + 1) items.

280

DISTRIBUTED SET OPERATIONS

K

D d1

d2

d3

d4

d5

d6

d7

d8

d9

d 10

N−K+1

FIGURE 5.1: The Kth smallest is precisely the (N − K + 1)th largest.

Finally, we will assume that the selection process will be coordinated by a single entity and that all communication will take place on a spanning tree of the network. Although it does not matter for the correctness of our protocols which entity is selected as coordinator and which spanning tree is chosen for communication, for efﬁciency reasons it is convenient to choose as coordinator a communication center s of the network and to choose as a spanning-tree SP(s) the shortest path spanning tree for s. Recall (Section 2.6.6) that a communicationcenter e is a node that minimizes the sum of the distances to all other nodes (i.e., v dG (v, s) is minimum). Also recall (Section 4.2.3) that, by deﬁnition of the shortest path spanning tree, PT(s) is such that dG (v, s) = dPT(s) (v, s) for all entities v. In the following we will assume that s is used as coordinator, and for simplicity we will denote PT(s) simply as T . 5.2.2 Selection in a Small Data Set We will ﬁrst consider the selection problem when the data set is rather small; more precisely, we consider data sets where N = O(n). A special instance of a small distributed set is when every Dx is a singleton: it contains just a single element dx ; this is, for example, the case when the only data available at a node is its id. Input Collection As the data set is small, the simple solution of collecting all the data at the coordinator and letting s solve locally the problem is actually not unfeasible from a complexity point of view. The cost of collecting all the data items at s is clearly v dG (v, s). To this, we must add an initial broadcast to notify the entities to send their data to the coordinator, and (if needed) a ﬁnal broadcast to notify them of the ﬁnal result; as these are done on a tree, their cost will be 2(n − 1) messages. Hence the total cost of this protocol that we can call Collect is M[Collect] =

v

communication.

dG (v, s) + 2(n − 1)

(5.2)

DISTRIBUTED SELECTION

281

Notice that, depending on the network, n−1≤

v

dG (v, s) ≤

n 2

n

−1

2

where the lower bound is achieved, for example, when G is a complete graph, and the upper is achieved, for example, when G is a ring. So M[Collect] = O(n2 ) in the worst case. This approach is somehow an overkill as the entire set is collected at s. Truncated Ranking It might be possible to reduce the amount of messages by making it dependent on the value of K. In fact we can use the existing ranking protocol for trees (Exercise 2.9.4) and execute it on T until the Kth smallest item is found. The use of the ranking algorithm will then cost no more than

2dG (v, s).

Rank(v)≤K

Note that, if K > N − K + 1 we can exploit Property 5.2.1 and use the ranking algorithm to assign ranks in decreasing order until the (N − K + 1)th largest element is ranked. In this case, the cost will then be no more than

dG (v, e).

Rank(v)≥K

To this we must add the initial broadcast to set up the ranking and a ﬁnal broadcast to notify the entities of the ﬁnal result; as these are done on a tree, their cost will be 2(n − 1) messages. Hence, assuming K ≤ N − K + 1, the total cost of this protocol that we can call Rank is M[Rank] ≤

2dG (v, s) + 2(n − 1).

(5.3)

Rank(v)≤K

Notice that, depending on the network, 2(K − 1) ≤

2dG (v, e) ≤

Rank(v)≤k

K 2

n−

K 2

+1

where the lower bound is achieved, for example, when G is a complete graph, and the upperbound could be achieved, for example, when G is a ring. This means that, in any case, M[Rank] ≤ n⌬ where ⌬ = Min{K, N − K + 1}. In other words, if K (or N − K + 1) is small, Rank will be much more efﬁcient than Collect. As K becomes larger, the cost increases until, when K = N/2, the two protocols have the same cost.

282

DISTRIBUTED SET OPERATIONS

IMPORTANT. The protocols we have seen are generic, in that they apply to any topology. For particular networks, it is possible to take advantage of the properties of the topology so to obtain a more efﬁcent selection protocol. This is the case of the ring (Exercise 5.6.1), the mesh (Exercise 5.6.2), and the complete binary tree (Exercise 5.6.3). The problem of designing a selection protocol that uses o(n2 ) messages in the worst case is still unsolved (Problem 5.6.1). 5.2.3 Simple Case: Selection Among Two Sites In the previous section we have seen how to perform selection when the number of data items is small: N = O(n). In general, this is not the case; in fact, not only N is much larger than n but it is order of magnitude so. So, in general, the techniques that we have seen so far are clearly not efﬁcient. What we need is a different strategy to deal with the general case, in particular when N >> n. In this section we will examine this problem in a simple setting when n = 2; that is, there are only two entities in the system, x and y. We will develop efﬁcient solution strategies; some of the insights will be useful when faced with a more general case in later sections. Median Let us consider ﬁrst the problem of determining the lower median, that is, D[N/2]. Recall that this is the unique element that has exactly N/2 − 1 elements smaller than itself and exactly N/2 elements larger than itself. A simple solution is the following. First of all, one of the entities (e.g., the one where the selection query originates, or the one with the smallest id) is elected, which will receive the entire set of the other entity. The elected entity, say x, will then locally determine the median of the set Dx ∪ Dy and communicate it, if necessary, to the other entity. Notice that as x has now locally available the entire data set, it can answer any selection query, not just for the lower median. The drawback of this solution is that the amount of communication is signiﬁcant as an entire local set is transferred. We can obviously elect the entity with the larger set to minimize the amount of messages; still, O(N ) messages must be transferred in the worst case. A more efﬁcient technique is clearly needed. We can design such a technique on the basis of a simple observation: if we compare the medians of the two local sets, then we can immediately eliminate almost half of the elements from consideration. Let us see why and how. Assume for simplicity that each local set contains N/2 = 2p−1 elements; this means that both Dx and Dy have a lower median, mx = Dx [2p−2 ] and my = Dy [2p−2 ] respectively. The lower median will have exactly N/2 − 1 = 2p − 1 elements smaller than itself and exactly N/2 = 2p elements larger than itself. For example, consider the two sets of size N/2 = 16 shown in Figure 5.2(a) where each black circle indicates a data element, and in each set the elements are shown locally sorted in a left-to-right increasing order; then mx = Dx [8] and my = Dy [8]. Assume that mx > my ; then each element in Dx larger than mx must also be larger than my . This means that each of them is larger than at least 2p−2 elements in Dx and that of at least 2p−2 elements in Dy ; that is, it has at least 2p−2 + 2p−2 = 2p−1 = N/2 elements smaller than itself, and therefore it can not be the lower median. In other

DISTRIBUTED SELECTION

283

mx Dx mx > m y Dy my (a)

too large Dx >m x

< my Dy too small (b)

Dx

Dy

(c)

FIGURE 5.2: Half of the elements can be discarded after a single comparison of the two local medians.

words, any element larger than the largest of the median of the two sets can be discounted from consideration as it is larger than the overall median. See Figure 5.2(b). Similarly, all the elements in Dy smaller than mx can be discounted as well. In fact, each such element would be smaller that at least 2p−2 elements in its own set and at least 2p−2 + 1 elements in the other set; that is, it has at least 22p−2 + 1 = 2p−1 + 1 = N/2 + 1 elements larger than itself, and therefore it can not be the lower median. See Figure 5.2(b). Thus, by locally calculating and then exchanging the median of each set, at least half of the elements of each set, and therefore half of the total number of elements, can be discounted; shown as white circle in Figure 5.2(c).

284

DISTRIBUTED SET OPERATIONS

There is a very interesting and important property (Exercise 5.6.4): the overall lower median is the lower median of the elements still under consideration. This means that we can reapply the same process to the elements still under consideration: the entities communicate to each other the lower median of the local elements under consideration, these are compared, and half of all this data are removed from consideration. In other words, we have just designed a protocol, that we shall call Halving, that is composed of a sequence of iterations; in each, half of the elements still under consideration are discarded and the sought global median is still the median of the considered data; this process is repeated until only a single element is left at each site and the median can be unambiguously determined. As we halve the problem size at every iteration, the total number of iterations is log N . Each iteration requires the communication of the local lower medians (of the elements still under consideration), a task that can be accomplished using just one message per iteration. The working of the protocol has been described assuming that N is a power of two and that both sets have the same number N/2 of elements. Fortunately, these two assumptions are not essential. In fact the protocol Halving can be adjusted to two arbitrarily sized sets without changing its complexity: Exercise 5.6.5. Arbitrary K We have just seen a simple and efﬁcient protocol for ﬁnding the overall (lower) median D[ N/2 ] of a set D distributed over two sites. Let us consider the general problem of selecting D[K], the Kth smallest element of D when K is arbitrary, 1 ≤ k ≤ N. Assume again, for simplicity, that the two sets have the same size N/2. We know already how to deal with the case of K = N/2. Case K < N/2 Consider ﬁrst the case when K < N/2. This means that each of the two sites has locally more than K elements. An example with N/2 = 12 and K = 4 is shown in Figure 5.3. Consider the set Dx . As we are looking for the Kth smallest data item overall, any data item greater than Dx [K] cannot be D[K] (as it will be larger than at least K data items). This means that we can immediately discount all these items, keeping only K items still under consideration. For example, in Figure 5.3(a) we have N/2 = 12 items shown in a left-to-right increasing order; if K = 4, then all the items greater than Dx [4] are too large to be D[4]: Figure 5.3(b). Similarly, we can keep under consideration in Dy just Dy [K] and the items that are smaller. IMPORTANT. Notice that D[K] is also the Kth smallest item among those kept in consideration; this is because we have discounted only the elements larger than D[K]. What is the net result of this ? We are now left with two sets of items, each of size K; see Figure 5.3(c). Among those items, we are looking for the Kth smallest

DISTRIBUTED SELECTION

285

Dx

Dy

(a)

>Dx [k]

Dx [k] Dx

too large Dy >Dy [k]

Dy [k] (b)

Dx

Dy

(c)

FIGURE 5.3: All the elements greater than the local Kth smallest element can be discarded.

element. In other words, once this operation has been performed, the problem we need to solve is to determine the lower median of the elements under consideration. We already know how to solve this problem efﬁciently. In other words, if K < N/2 we can reduce the problem to that of ﬁnding the lower median. Notice that this is accomplished without any communication, once it is known that we are looking for D[K]. Case K > N/2 Consider next the case when K > N/2. This means that each of the two sites has locally less than K elements, thus we cannot use the approach we did for K < N/2. Still, we can make a similar reduction also in this case. To see how and why, consider the following obvious but important property of any totally ordered set.

286

DISTRIBUTED SET OPERATIONS

Looking for the Kth smallest is the same as looking for the (N − K + 1)th largest. This fact is an important practical consequence. First of all observe that if K > N/2 then N − K + 1 < N/2. Further observe that the (N − K + 1)th largest item is the only one that has exactly N − k larger than itself and exactly K − 1 smaller than itself. Consider Dx . As we are looking for the (N − K + 1)th largest data item overall, (as there are at least any data item smaller than D x [N − K + 1] cannot be D[K] N − K + 1 larger data items). This means that we can immediately discount all these items, keeping only N − K + 1 items still under consideration. For example, in Figure 5.4(a) we have N = 24 items equidistributed between the two sites, whose items are shown in a left-to-right increasing order. If K = 21, then N − K + 1 = 4; that is, we are looking for the 4th largest item overall; then all the items smaller than the 4th largest in Dx , that is, smaller than Dx [4], are too small to be D[21] = D[4], see Figure 5.3(b). Similarly, we can keep under consideration in Dy just D y [N − K + 1] and the items that are larger.

Dx

Dy

(a)

> Dx [4]

Dx [4]

Dx too small (b)

Dx

Dy

(c)

FIGURE 5.4: All the data item smaller than the local (N−K+1)th largest element can be discarded.

DISTRIBUTED SELECTION

287

IMPORTANT. Notice that D[K] is the (N − K + 1)th largest item among those kept in consideration; this is because we have discounted only elements smaller than D[K]. What is the net result of this ? We are now left with two sets of items, each of size N − K + 1; see Figure 5.4(c). Among those items, we are looking for the (N − K + 1)th largest element. In other words, once this operation has been performed, the problem we need to solve is to determine the upper median of the elements under consideration. We already know how to solve this problem efﬁciently. Summary Regardless of the value of K we can always transform the K-selection problem into a median-ﬁnding problem. Notice that this is accomplished without any additional communication, once it is known that we are looking for D[K]. In the description we have assumed that both sites have the same number of element, N/2. If this is not the case, it is easy to verify (Exercise 5.6.6) that the same type of reduction can still take place. Hacking As we have seen, median ﬁnding is “the” core problem to solve. Our solution, Halving, is efﬁcient. This protocol can be made more efﬁcient by observing that we can discard (because it is too large to be the median) any element greater than mx not only in Dx but also in Dy (if there is any); similarly, we can discard the elements smaller than my (because it is too small to be the median) not only from Dy but also from Dx (if there is any). In this way we can reduce the amount of elements still under consideration by more than half, thus possibly reducing the number of iterations. CAUTION: The number of discarded items that are greater than the median might be larger than the number of discarded items that are smaller than the median (or vice versa). This means that the overall lower median we are looking for is no longer the median of the elements left under consideration. In other words, after removing items from consideration, we might be left with a general selection problem. By now, we know how to reduce a selection problem to the median-ﬁnding one. The resulting protocol, that we shall call GeneralHalving, will use a few more messages, in each iteration but might yield a larger reduction (Exercise 5.6.7). Generalization This technique can be generalized to three sites; however, we are no longer able to reduce the number of items still under consideration to at most half at each iteration (Exercise 5.6.9). For larger n > 3 the technique we have designed for two sites is unfortunately no longer efﬁciently scalable. Fortunately, some lessons we have learned when dealing with the two sites are immediately and usefully applicable to any n, as we will discuss in the next section. 5.2.4 General Selection Strategy: RankSelect In the previous section we have seen how to perform selection when the number of data items is small or there are only two sites. In general, this is not the case. For

288

DISTRIBUTED SET OPERATIONS

example, in most practical applications, the number of sites is 10–100, while the amount of data at each site is ≥ 106 . What we need is a different strategy to deal with the general case. Let us think of the set D containing the N elements as a search space in which we need to ﬁnd d ∗ = D[K], unknown to us, and the only thing we know about d ∗ is its rank Rank[d ∗ , D] = K. An effective way to handle the problem of discovering d ∗ is to reduce as much as possible the search space, eliminating from consideration as many items as possible, until we ﬁnd d ∗ or the search space is small enough (e.g., O(n)) for us to apply the techniques discussed in the previous section. Suppose that we (somehow) know the rank Rank[d, D] of a data item d in D. If Rank[d, D] = K then d is the element we were looking for. If Rank[d, D] < K then d is too small to be d ∗ , and so are all the items smaller than d. Similarly, if Rank[d, D] > K, then d is too large to be d ∗ , and so are all the items larger than d. This fact can be employed to design a simple and, as we will see, rather efﬁcient selection strategy: Strategy RankSelect: 1. Among the data items under consideration, (initially, they all are) choose one, say d. 2. Determine its overall rank k = Rank[d, D]. 3. If k = K then d = d ∗ and we are done. Else, if k < K, (respectively, k > K) remove from consideration d all the data items smaller (respectively, larger) than d and restart the process. Thus, according to this strategy, the selection process consists of a sequence of iterations, each reducing the search space, performed until d ∗ is found. Notice that we could stop the process as soon as just few data items (e.g., O(n)) are left for consideration, and then apply protocol Rank. Most of the operations performed by this strategy are rather simple to implement. We can assume that a spanning tree of the network is available and will be used for all communication, and an entity is elected to coordinate the overall execution (becoming the root of the tree for this protocol). Any entity can act as a coordinator and any spanning-tree T of the network will do. However, for efﬁciency reasons, it is better to choose as a coordinator the communication center s of the network, and choose as a tree T the shortest path spanning-tree PT(s) of s. Let d(i) be the item selected at the beginning of iteration i. Once d(i) is chosen, the determination of its rank is a trivial broadcast (to let every entity know d(i)) started by the root s and a convergecast (to collect the partial rank information) ending at the root s. Recall Exercise 2.9.43. Once d(i) has determined the rank of d(i), s will notify all other entities of the result: d(i) = d ∗ , d(i) < d ∗ , or d(i) > d ∗ ; each entity will then act accordingly (terminating or removing some elements from consideration).

DISTRIBUTED SELECTION

289

The only operation still to be discussed is how we choose d(i). The choice of d(i) is quite important because it affects the number of iterations and thus the overall complexity of the resulting protocol. Let us examine some of the possible choices and their impact. Random Choice We can choose d(i) uniformly at random; that is, in such a way that each item of the search space has the same probability of being chosen. How can s choose d(i) uniformly at random ? In Section 2.6.7 and Exercise 2.9.52 we have discussed how to select, in a tree, uniformly at random an item from the initial distributed set. Clearly that protocol can be used to choose d(i) in the ﬁrst iteration of our algorithm. However, we cannot immediately use it in the subsequent iterations. In fact, after an iteration, some items are removed from consideration; that is, the search space is reduced. This means that, for the next iteration, we must ensure we select an item that is still in new search space. Fortunately, this can be achieved with simple readjustments to the protocol of Exercise 2.9.52, achieving the same cost in each iteration (Exercise 5.6.10). That is, each iteration costs at most 2(n − 1) + dT (s, x) messages and 2r(s) + dT (s, x) ideal time units for the random selection plus an additional 2(n − 1) messages and 2r(s) time units to determine the rank of the selected element. Let us call the resulting protocol RandomSelect. To determine its global cost, we need to determine the number of iterations. In the worst case, in iteration i we remove from the search space only d(i); so the number of iterations can be as bad as N , for a worst case cost of M[RandomSelect] ≤ (4(n − 1) + r(s)) N,

(5.4)

T [RandomSelect] ≤ 5 r(s) N.

(5.5)

However, on the average, the power of making a random choice is evident; in fact (Exercise 5.6.11): Lemma 5.2.1 The expected number of iterations performed by Protocol RandomSelect until termination is at most 1.387 log N + O(1). This means that, on the average Maverage [RandomSelect] = O(n log N ),

(5.6)

Taverage [RandomSelect] = O(n log N ).

(5.7)

As mentioned earlier, we could stop the strategy RankSelect, and thus terminate protocol RandomSelect, as soon as O(n) data items are left for consideration, and then apply protocol Rank. See Exercise 5.6.12.

290

DISTRIBUTED SET OPERATIONS

Random Choice with Reduction We can improve the average message complexity by exploiting the properties discussed in Section 5.2.1. Let ⌬(i) = min{K(i), N (i) − K(i) + 1}. In fact, by Property 5.2.2, if at the beginning of iteration i, an entity has more than K(i) elements under consideration, it needs to consider only the K(i) smallest and immediately remove from consideration the others; similarly, if it has more than N (i) − K(i) + 1 items, it needs to consider only the N (i) − K(i) + 1 largest and immediately remove from consideration the others. If every entity does this, the search space can be further reduced even before the random selection process takes place. In fact, the net effect of the application of this technique is that each entity will have at most ⌬(i) = min{K(i), N (i) − K(i) + 1} items still under consideration during iteration i. The root s can then perform random selection in this reduced space of size n(i) ≤ N (i). Notice that d ∗ will have a new rank k(i) ≤ K(i) in the new search space. Speciﬁcally, our strategy will be to include, in the broadcast started by the root s at the beginning of iteration i, the values N (i) and K(i). Each entity, upon receiving this information, will locally perform the reduction (if any) of the local elements and then include in the convergecast the information about the size of the new search space. At the end of the convergecast, s knows both n(i) and k(i) as well as all the information necessary to perform the random selection in the reduced search space. In other words, the total number of messages per iteration will be exactly the same as that of Protocol RandomSelect. In the worst case this change does not make any difference. In fact, for the resulting protocol RandomFlipSelect, the number of iterations can still be as bad as N (Exercise 5.6.13), for a worst case cost of M[RandomFlipSelect] ≤ (2(n − 1) + r(s)) N,

(5.8)

T [RandomFlipSelect] ≤ 3 r(s) N.

(5.9)

The change does however make a difference on the average cost. In fact, (Exercise 5.6.14) Lemma 5.2.2 The expected number of iterations performed by Protocol RandomFlipSelect until termination is less than ln(⌬) + ln(n) + O(1) where ln() denotes the natural logarithm (recall that ln() = .693 log()). This means that, on the average Maverage [RandomFlipSelect] = O(n (ln(⌬) + ln(n)))

(5.10)

Taverage [RandomFlipSelect] = O(n (ln(⌬) + ln(n))).

(5.11)

DISTRIBUTED SELECTION

291

Also in this case, we could stop the strategy RankSelect, and thus terminate protocol RandomSelect, as soon as only O(n) data items are left for consideration, and then apply protocol Rank. See Exercise 5.6.15. Selection in a Random Distribution So far, we have not made any assumption on the distribution of the data items among the entities. If we know something about how the data are distributed, we can clearly exploit this knowledge to design a more efﬁcient protocol. In this section we consider a very simple and quite reasonable assumption about how the data are distributed. Consider the set D; it is distributed among the entities x1 , . . . , xn ; let n[xj ] = |Dxj | be the number of items stored at xj . The assumption we will make is that all the distributions of D that end up with n[xj ] items at xj , 1 ≤ j ≤ n, are equally likely. In this case we can reﬁne the selection of d(i). Let z(i) be the entity where the number of elements still under consideration in iteration i is the largest; that is, ∀x m(i) = |Dz(i) (i)| ≥ |Dx (i)|. (If there is more than one entity with the same number of items, choose an arbitrary one.) In our protocol, which we shall call RandomRandomSelect, we will choose d(i) to be the h(i)th smallest item in the set Dz(i) (i), where h(i) = K(i) m(i)+1 − 21 . N+1 We will use this choice until there are less than n items under consideration. At this point, in Protocol RandomRandomSelect we will use Protocol RandomFlipSelect to ﬁnish the job and determine d ∗ . Notice that also in this protocol, each iteration can easily be implemented (Exercise 5.6.16) with at most 4(n − 1) + r(s) messages and 5r(s) ideal time units. With the choice of d(i) we have made, the average number of iterations, until there are less than n items left under consideration, is indeed small. In fact (Exercise 5.6.17), Lemma 5.2.3 Let the randomness assumption hold. Then the expected number of iterations performed by Protocol RandomRandomSelect until there are less than n items under consideration is at most 4 3 log log ⌬ + 1 .

This means that, on the average Maverage [RandomRandomSelect] = O(n(log log ⌬ + log n)) and Taverage [RandomRandomSelect] = O(n(log log ⌬ + log n)).

(5.12) (5.13)

Filtering The drawback of all previous protocols rests on their worst case costs: O(nN) messages and O(r(s)N ) time; notice that this cost is more than that of input collection, that is, of mailing all the items to s. It can be shown that the probability of the occurrence of the worst case is so small that it can be neglected. However, there

292

DISTRIBUTED SET OPERATIONS

might be systems where such a cost is not affordable under any circumstances. For these systems, it is necessary to have a selection protocol that, even if less efﬁcient on the average, can guarantee a reasonable cost even in the worst case. The design of such a system is fortunately not so difﬁcult; in fact it can be achieved with the strategy RankSelect with the appropriate choice of d(i). As before, let Dxi denote the set of elements still under consideration at x in iteration i and nix = |Dxi | denote its size. Consider the (lower) median dxi = Dxi [ nix /2 ] of Dxi , and let M(i) = {dxi } be the set of these medians. With each element in M(i) associate a weight; the weight associated with dxi is just the size of the corresponding set nix . Filter: Choose d(i) to be the weighted (lower) median of M(i). With this choice, the number of iterations is rather small (Exercise 5.6.18): Lemma 5.2.4 The number of iterations performed by Protocol Filter until there are no more than n elements left under consideration is at most 2.41 log(N/n). Once there are at most n elements left after consideration, the problem can be solved using one of the known techniques, for example, Rank, for small sets. However, each iteration requires a complex operation; in fact we need to ﬁnd the median of the set M(i) in iteration i. As the set is small (it contains at most n elements), this can be done using, for example, Protocol Rank. In the worst case, it will require O(n2 ) messages in each iteration. This means that, in the worst case, N M[Filter] = O n2 log n N . T [Filter] = O n log n

(5.14) (5.15)

5.2.5 Reducing the Worst Case: ReduceSelect The worst case we have obtained by using the Filter choice in strategy RankSelect is reasonable but it can be reduced using a different strategy. This strategy, and the resulting protocol that we shall call ReduceSelect, is obtained mainly by combining and integrating all the techniques we have developed so far for reducing the search space with new, original ones. Reduction Tools so far.

Let us summarize ﬁrst of all the main basic tool we have used

Reduction Tool 1: Local Contraction If entity x has more than ⌬ items under consideration, it can immediately discard any item greater than the local Kth smallest element and any item smaller than the local (N − K + 1)th largest element.

DISTRIBUTED SELECTION

293

This tool is based on Property 5.2.2. The requirement for the application of this tool is that each site must know K and N . The net effect of the application of this tool is that, afterwards, each site has at most ⌬ items under considerations that are stored locally. Recall that we have used this reduction tool already when dealing with the two sites case, as well as in Protocol RandomFlipSelect. A different type of reduction is offered by the following tool. Reduction Tool 2: Sites Reduction If the number of entities n is greater than K (respectively, N − K + 1), then n − N entities (respectively n − N + K − 1) and all their data items can be removed from consideration. This can be achieved as follows. 1. Consider the set Dmin = {Dx [1]} (respectively Dmax = {Dx [|Dx |]}) of the smallest (respectively, the largest) item at each entity. 2. Find the Kth smallest (respectively, (N − K + 1)th largest) element, call it w, of this set. NOTE: This set has n elements; hence this operation can be performed using protocol Rank. 3. If Dx [1] > w (respectively Dx [|Dx |] < w) then the entire set Dx can be removed from consideration. This reduction technique immediately reduces the number of sets involved in the problem to at most ⌬. For example, consider the case of searching for the 7th largest item when the N data items of D are distributed among n = 10 entities. Consider now the largest element stored at each entity (they form a set of 10 elements), and ﬁnd the 7th largest of them. The 8th largest element of this set cannot possibly be the 7th largest item of the entire distributed set D; as it is the largest item stored at the entity from which it originated, none of the other items stored at that entity can be the 7th largest element either; so we can remove from consideration the entire set stored at that entity. Similarly we can remove also the sets where the 9th and the 10th largest came from. These two tools can obviously be used one after the other. The combined use of these two tools reduces the problem of selection in a search space of size N distributed among n sites to that of selection among Min {n, ⌬} sites, each with at most ⌬ elements. This means that, after the execution of these two tools, the new search space contains at most ⌬2 data items. Notice that once the tools have been applied, if the size of the search space and/or the rank of f ∗ in that space have changed, it is possible that the two tools can be successfully applied again. For example, consider the case depicted in Table 5.1, where N = 10, 032 is distributed among n = 5 entities, x1 , . . . x5 , and where we are looking for the Kth smallest element in this set, where K = 4096. First observe that, when we apply the two Reduction Tools, only the ﬁrst one (Contraction) will be successful. The effect will be to remove from consideration many elements from x1 , all larger than f ∗ . In other words, we have signiﬁcantly reduced the search space without changing the rank of f ∗ in the search space. If we apply again the two Reduction Tools to the new

294

DISTRIBUTED SET OPERATIONS

TABLE 5.1: Repeated use of the Reduction Tools N : size of search space

K : rank of f ∗ in search space

x1

x2

x3

x4

x5

10, 032 4, 126 65

4, 096 4, 096 33

10, 000 4, 096 33

20 20 20

5 5 5

5 5 5

2 2 2

conﬁguration, again only the ﬁrst one (Contraction) will be successful; however the second will further drastically reduce the size of the search space (the variable N ) from 4126 to 65 and the rank of f ∗ in the new search space (the variable K) from 4096 to 33. This fact means that we can iterate Local Contraction until there will no longer be any change in the search space and in the rank of f ∗ in the search space. This will occur when at each site xi the number of items still under consideration ni is not greater than ⌬ = min{K , N − K + 1}, where N is the size of the search space and K the rank of f ∗ in the search space. We will then use the Sites Reduction tool. The reduction protocol REDUCE based on this repeated use of the two Reduction Tools is shown in Figure 5.5. Lemma 5.2.5 After the execution of Protocol REDUCE, the number of items left under consideration is at most ⌬ min{n, ⌬}. The single execution of Sites Reduction requires selection in a small set discussed in Section 5.2.2. Each execution of Local Contraction required by Protocol REDUCE requires a broadcast and a convergecast, and costs 2(n − 1) messages and 2r(s) time. To determine the total cost we need to ﬁnd out the number of times Local Contraction is executed. Interestingly, this will occur a constant number of times, three times to be precise (Exercise 5.6.19). REDUCE begin N = N; K = K; ⌬ = ⌬; ni = ni , 1 ≤ i ≤ n; while ∃xi such that ni > ⌬ do perform Local Contraction; * update the values of N , K , ⌬ , ni (1 ≤ i ≤ n) endwhile if n > ⌬ then perform Sites Reduction; endif end FIGURE 5.5: Protocol REDUCE.

DISTRIBUTED SELECTION

295

Cutting Tools The new tool we are going to develop is to be used whenever the number n of sets is at most ⌬ and each entity has at most ⌬ items; this is, for example, the result of applying Tools 1 and 2 described before. Thus, the search space contains at most ⌬2 items. For simplicity, and without loss of generality, let K = ⌬ (the case N − K + 1 = ⌬ is analogous). To aid in the design, we can visualize the search space as an array D of size n × ⌬, where the rows correspond to the sets of items, each set sorted in an increasing order, and the columns specify the rank of that element in the set. So, for example, di,j is the jth smallest item in the set stored at entity xi . Notice that there is no relationship among the elements of the same column; in other words, D is a matrix with sorted rows but unsorted columns. Each column corresponds to a set of n elements distributed among the n entities. If an element is removed from consideration, it will be represented by +∞ in the corresponding entry in the array. Consider the set C(2), that is, all the second-smallest items in each site. Focus on the kth smallest element m(2) of this set, where k = K/2. By deﬁnition, m(2) has exactly k − 1 elements smaller than itself in C(2); each of them, as well as m(2), has another item smaller than itself in its own row (this is because they are second-smallest in their own set). This means that, as far as we know, m(2) has at least (k − 1) + k = 2k − 1 ≥ K − 1 items smaller than itself in the global set D; this implies that any item greater than m(2) cannot be the Kth smallest item we are looking for. In other words, if we ﬁnd m(2), then we can remove from consideration any item larger than m(2). Similarly, we can consider the set C(2i ), where 2i ≤ K, composed of the 2i th smallest items in each set. Focus again on the kth smallest element m(2i ) of C(2i ), where k = K/2i . By deﬁnition, m(2i ) has exactly k − 1 elements smaller than itself in C(2); each of them, as well as m(2i ), has another 2i − 1 items smaller than itself in its own row (this is because they are the 2i th smallest in their own set). This means that m(2i ) has at least (k − 1) + k (2i − 1) = k 2i − 1 ≥

K 2i

2i − 1 = K − 1

items smaller than itself in the global set D; this implies that any item greater than m(2i ) cannot be the Kth smallest item we are looking for. In other words, if we ﬁnd m(2i ), then we can remove from consideration any item larger than m(2i ). Thus, we have a generic Reduction Tool using columns whose index is a power of two.

296

DISTRIBUTED SET OPERATIONS

CUT begin k = K/2; l := 2; while k ≥ log K and search space is not small do if in C(2l ) there are ≥ k items still under consideration then * use the CuttingT ool : find the kth smallest element m(l) of C(l); remove from consideration all the elements greater than m(l). endif k := k/2; l := 2l; endwhile end FIGURE 5.6: Protocol CUT.

Cutting Tool Let l = 2i ≤ K and k = K/ l . Find the kth smallest element m(l) of C(l), and remove from consideration all the elements greater than m(l). The Cutting Tool can be implemented using any protocol for selection in small sets (recall that each C(l) has at most n elements), such as Rank; a single broadcast will notify all entities of the outcome and allow each to reduce its own set if needed. On the basis of this tool we can construct a reduction protocol that sequentially uses the Cutting Tool ﬁrst using C(2), then C(4), then C(8), and so on. Clearly, if at any time the search space becomes small (i.e., O(n)), we terminate. This reduction algorithm, that we will call CUT, is shown in Figure 5.6. Let us examine the reduction power of Procedure CUT. After executing the Cutting Tool on C(2), only one column, C(1), might remain unchanged; all others, including C(2), will have at least half of the entries +∞. In general, after the execution of Cutting Tool on C(l = 2i ), only the l − 1 columns C(1), C(2), . . . , C(l − 1) might remain unchanged; all others, including C(l) will have at least n − K/ l of the entries +∞ (Exercise 5.6.20). This can be used to show (Exercise 5.6.21) that Lemma 5.2.6 After the execution of Protocol CUT, the number of items left under consideration is at most min{n, ⌬} log ⌬. Each of the log ⌬ execution of the Cutting Tool performed by Protocol CUT requires a selection in a set of size at most min{n, ⌬}. This can be performed using any of the protocols for selection in a small set, for example, Protocol Rank. In the worst case, it will require O(n2 ) messages in each iteration. This means that, in the worst case, M[CU T ] = O(n2 log ⌬),

(5.16)

T [CU T ] = O(n log ⌬).

(5.17)

SORTING A DISTRIBUTED SET

297

ReduceSelect begin

REDUCE;

if search space greater than O(⌬ ) then CUT if search space greater than O(n) then Filter Rank; end FIGURE 5.7: Protocol ReduceSelect.

Putting It All Together We have examined a set of Reduction Tools. Summarizing, Protocol REDUCE, composed of the application of Reduction Tools 1 and 2, reduces the search space from N to at most ⌬2 . Protocol CUT, composed of a sequence of applications of the Cutting Tool, reduces the search space from ⌬2 to at most min{n, ⌬} log ⌬. Starting from these reductions, to form a full selection protocol, we will ﬁrst reduce the search space from min{n, ⌬} log ⌬ to O(n) (e.g. using Protocol Filter) and then use a protocol for small sets (e.g. Rank) to determine the sought item. In other words, resulting algorithm, Protocol ReduceSelect, will be as shown in Figure 5.7, where ⌬ is the new value of ⌬ after the execution of REDUCE. Let us examine the cost of Protocol ReduceSelect. Protocol REDUCE, as we have seen, requires at most 3 iterations of Local Contractions, each using 2(n − 1) messages and 2r(s) time, and one execution of Sites Reduction that consists in an execution of Rank. Protocol CUT is used with N ≤ min{n, ⌬}⌬ and, as we have seen, thus, requires at most log ⌬ iterations of the Cutting Tools, each consisting in an execution of Rank. Protocol Filter, as we have seen, is used with N ≤ min{n, ⌬} log ⌬ and, as we have seen, thus, requires at most log log ⌬ iterations, each costing 2(n − 1) messages and 2r(s) time plus an execution of Rank. Thus, in total, we have M[ReduceSelect] = (log ⌬ + 4.5 log log ⌬ + 2)M[Rank] + (6 + 4.5 log log ⌬)(n − 1),

(5.18)

T [ReduceSelect] = (log ⌬ + 4.5 log log ⌬ + 2)T [Rank] + (6 + 4.5 log log ⌬)2r(s).

(5.19)

5.3 SORTING A DISTRIBUTED SET 5.3.1 Distributed Sorting Sorting is perhaps the most well known and investigated algorithmic problem. In distributed computing systems, the setting where this problem takes place as well as its nature is very different from the serial as well as parallel ones. In particular, in our setting, sorting must take place in networks of computing entities where no central controller is present and no common clock is available. Not surprisingly, most

298

DISTRIBUTED SET OPERATIONS

{11, 22, 30, 34, 45}

{68, 69, 71, 75}

{68, 69, 71, 75, 82}

{11, 22, 30, 34}

3

2

3

2

1

4

1

4

{56, 57}

{82, 85, 87}

{85, 87}

(a)

{45, 56, 57}

(b)

FIGURE 5.8: Distribution sorted according to (a) π = 3124 and (b) π = 2431.

of the best serial and parallel sorting algorithms do very poorly when applied to a distributed environment. In this section we will examine the problem, its nature, and its solutions. Let us start with a clear speciﬁcation of the task and its requirements. As before in this chapter, we have a distribution Dx1 , . . . , Dxn of a set D among the entities x1 , . . . , xn of a system with communication topology G, where Dxi is the set of items stored at xi . Each entity xi , because of the Distinct Identiﬁers assumption ID, has a unique identity id(i), from a totally ordered sets. For simplicity, in the following we will assume that the ids are the numbers 1, 2, . . . , n and that id(i) = i, and we will denote Dxi simply by Di . Let us now focus on the deﬁnition of a sorted distribution. A distribution is (quite reasonably) considered sorted if, whenever i < j , all the data items stored at xi are smaller than the items stored at xj ; this condition is usually called increasing order. A distribution is also considered sorted if all the smallest items are in xn , the next ones in xn−1 , and so on, with the largest ones in x1 ; usually, we call this condition decreasing order. Let us be precise. Let π be a permutation of the indices {1, . . . , n}. A distribution D1 , . . . , Dn is sorted according to π if and only if the following Sorting Condition holds: π(i) < π (j )

⇒

∀d ∈ Di , d ∈ Dj

d < d .

(5.20)

In other words, if the distribution is sorted according to π , then all the smallest items must be in xπ(1) , the next smallest ones in xπ(2) , and so on, with the largest ones in xπ(n) . So the requirement that the data are sorted according to the increasing order of the ids of the entities is given by the permutation π = 1 2 . . . n. The requirement of being sorted in a decreasing order is given by the permutation π = n (n − 1) . . . 1. For example, in Figure 5.8(b), the set is sorted according to the permutation π = 2 4 3 1; in fact, all the smallest data items are stored at x2 , the next ones in x4 , the yet larger ones in x3 , and all the largest data items are stored at x1 . We are now ready to deﬁne the problem of sorting a distributed set.

SORTING A DISTRIBUTED SET

299

Sorting Problem Given a distribution D1 , . . . , Dn of D and a permutation π , the distributed sorting problem is the one of moving data items among the entities so that, upon termination, 1. D1 , . . . , Dn is a distribution of D, where Di is the ﬁnal set of data at xi ; 2. D1 , . . . , Dn is sorted according to π. Note that the deﬁnition does not say anything about the relationship between the sizes of the initial sets Di s and those of the ﬁnal sets Di s. Depending on which requirement we impose, we have different versions of the problem. There are three fundamental requirements: invariant-sized sorting: |Di | = |Di |, 1 ≤ i ≤ n, that is, each entity ends up with the same number of items it started with. equidistributed sorting: |Dπ(i) | = N/n for 1 ≤ i < n and |Dπ(n) | = N − (n − 1)N/n, that is, every entity receives the same amount of data, except for xπ(n) that might receive fewer items. compacted sorting: |Dπ(i) | = min{w, N − (i − 1)w}, where w ≥ N/n is the storage capacity of the entities, that is, each entity, starting from xπ(1) , receives as many unassigned items as it can store. Notice that equidistributed sorting is a compacted sorting with w = N/n. For some of the algorithms we will discuss, it does not really matter which requirement is used; for some protocols, however, the choice of the requirement is important. In the following, unless otherwise speciﬁed, we will use the invariant-sized requirement. From the deﬁnition, it follows that when sorting a distributed set the relevant factors are the permutation according to which we sort, the topology of the network in which we sort, the location of the entities in the network, as well as the storage requirements. In the following two sections, we will examine some special cases that will help us understand these factors, their interplay, and their impact. 5.3.2 Special Case: Sorting on a Ordered Line Consider the case when we want to sort the data according to a permutation π , and the network G is a line where xπ(i) is connected to xπ(i+1) , 1 ≤ i < n. This case is very special. In fact, the entities are located on the line in such a way that their indices are ordered according to the permutation π . (The data, however, is not sorted.) For this reason, G is also called an ordered line. As an example, see Figure 5.9, where π = 1, 2, . . . , n. A simple sorting technique for an ordered line is OddEven-LineSort, based on the parallel algorithm odd-even-transposition sort, which is in turn based on the well known serial algorithm Bubble Sort. This technique is composed of a sequence of iterations, where initially j = 0.

300

DISTRIBUTED SET OPERATIONS

{1, 9, 13, 18}

1

{10, 15, 16}

2 {3, 6, 8, 20}

3

{5, 11, 14}

4

5

{2, 7, 12}

FIGURE 5.9: A distribution on a ordered line of size n = 5.

Technique OddEven-LineSort: 1. In iteration 2j + 1 (an odd iteration), entity x2i+1 exchanges its data with neighbour x2i+2 , 0 ≤ i ≤ n2 − 1; as a result, x2i+1 retains the smallest items while x2i+2 retains the largest ones. 2. In iteration 2j (an even iteration), entity x2i exchanges its data with neighbour x2i+1 , 1 ≤ i ≤ n2 − 1; as a result, x2i retains the smallest items while x2i+1 retains the largest ones. 3. If no data items change of place at all during an iteration (other than the ﬁrst), then the process stop. A schematic representation of the operations performed by the technique OddEvenLineSort is by means of the “sorting diagram”: a synchronous TED (time-event diagram) where the exchange of data between two neighboring entities is shown as a bold line connecting the time lines of the two entities. The sorting diagram for a line of n = 5 entities is shown in Figure 5.10. In the diagram are clearly visible the alternation of “odd” and “even” steps. To obtain a fully speciﬁed protocol, we still need to explain two important operations: termination and data exchange. Termination. We have said that we terminate when no data items change of place at all during an iteration. This situation can be easily determined. In fact, at the end of an iteration, each entity x can set a Boolean variable change to true or false to indicate whether or not its data set has changed during that iteration. Then, we can check (by computing the AND of those variables) if no data items have changed place at all during that iteration; if this is the case for every entity, we terminate, else we start the next iteration.

x1 x2 x3 x4

.... .... .... ....

x5

FIGURE 5.10: Diagram of operations of OddEven-LineSort in a line of size n = 5.

301

SORTING A DISTRIBUTED SET

Data Exchange. At the basis of the technique there is the exchange of data between two neighbors, say x and y; at the end of this exchange, that we will call merge, x will have the smallest items and y the largest ones (or vice versa). This speciﬁcation is, however, not quite precise. Assume that, before the merge, x has p items while y has q items, where possibly p = q; how much data should x and y retain after the merge ? The answer depends, partially, on the storage requirements. If we are to perform a invariant-sized sorting, x should retain p items and y should retain q items. If we are to perform a compacted sorting, x should retain min{w, (p + q)} items and y retain the others. If we are to perform a equidistributed sorting, x should retain min{N/n, p + q} items and y retain the others. Notice that, in this case each entity need to know both n and N . The results of the execution of OddEven-LineSort with an invariant-sized in the sorted line of Figure 5.9 is shown in Table 5.2. The correctness of the protocol, although intuitive, is not immediate (Exercises 5.6.23, 5.6.24, 5.6.25, and 5.6.26). In particular, the so-called “0 − 1 principle” (employed to prove the correctness of the similar parallel algorithm) can not be used directly in our case. This is due to the fact that the local data sets Di may contain several items, and may have different sizes. Cost The time cost is clearly determined by the number of iterations. In the worst case, the data items are initially sorted the “wrong” way; that is, the initial distribution is sorted according to permutation π = π(n), π (n − 1), . . . , π(1). Consider the largest item; it has to move from x1 to xn ; as it can only move by one location per iteration, to complete its move it requires n − 1 iterations. Indeed this is the actual cost for some initial distributions (Exercise 5.6.27). Property 5.3.1 OddEven-LineSort sorts an equidistributed distribution in n − 1 iterations if the required sorting is (a) invariant-sized, or (b) equidistributed, or (c) compacted.

TABLE 5.2: Execution of OddEven-LineSort on the System of Figure 5.9 iteration 1 2 3 4 5 6

x1 {1,9,13,18} → {1,3,6,8} {1,3,6,8} → {1,2,3,6} {1,2,3,6} → {1,2,3,5}

x2

x3

← {3,6,8,20} {9,13,18,20} → ← {2,7,9,10} {7,8,9,10} → ← {5,7,8,9} {6,7,8,9} →

{2,7,12} → ← {2,7,10} {13,18,20} → ← {5,11,12} {10,11,12} → ← {10,11,12}

x4 ← {10,15,16} {12,15,16} → ← {5,11,12} {13,18,20} → ← {13,14,15} {13,14,15} →

x5 {5,11,14} ← {5,11,14} {14,15,16} ← {14,15,16} {16,18,20} ← {16,18,20}

302

DISTRIBUTED SET OPERATIONS

Interestingly, the number of iterations can actually be much more than n − 1 if the initial distribution is not equidistributed. Consider, for example, an invariant-sized sorting when the initial distribution is sorted according to permutation π = π(n), π (n − 1), . . . , π(1). Assume that x1 and xn have each kq items, while x2 has only q items. All the items initially stored in x1 must end up in xn ; however, in the ﬁrst iteration only q items will move from x1 to x2 ; because of the “odd-even” alternation, the next q items will leave x1 in the 3rd iteration, the next q in the 5th, and so on. Hence, the total number of iterations required for all data to move from x1 to xn is at least n − 1 + 2(k − 1). This implies that, in the worst case, the time costs can be considerably high (Exercise 5.6.28): Property 5.3.2 OddEven-LineSort performs an invariant-sized sorting in at most N − 1 iterations. This number of iterations is achievable. Assuming (quite unrealistically) that the entire data set of an entity can be sent in one time unit to its neighbor, the time required by all the merge operations is exactly the same as the number of iterations. In contrast to this, to determine termination, we need to compute the AND of the Boolean variables change at each iteration. This operation can be done on a line in time n − 1 at each iteration. Thus, in the worst case, T[OddEven − LineSortinvariant ] = O(nN ).

(5.21)

Similarly, bad time costs can be derived for equidistributed sorting and compacted sorting. Let us focus now on the number of messages for invariant-sized sorting. If we do not impose any size constraints on the initial distribution then, by Property 5.3.2, the number of iterations can be as bad as N − 1; as in each iteration we perform the computation of the function AND, and this requires 2(n − 1) messages, it follows that the protocol will use 2(n − 1)(N − 1) messages just for computing the AND. To this cost we still need to add the number of messages used for the transfer of data items. Hence, without storage constraints on the initial distribution, the protocol has a very high cost due to the high number of iterations possible. Let us consider now the case when the initial distribution is equidistributed. By property 5.3.1, the number of iterations is at most n − 1 (instead of N − 1). This means that the cost of computing the AND is O(n2 ) (instead of O(N n)). Surprisingly, even in this case, the total number of messages can be very high. Property 5.3.3 OddEven-LineSort can use O(N n) messages to perform an invariant-sized sorting. This cost is achievable even if the data is initially equidistributed.

SORTING A DISTRIBUTED SET

303

To see why this is the case, consider an initial equidistribution sorted according to permutation π = π(n), π (n − 1), . . . , π(1). In this case, every data item will change location in each iteration (Exercise 5.6.29), that is, O(N ) messages will be sent in each iteration. As there can be n − 1 iterations with an initial equidistribution (by Property 5.3.1), we obtain the bound. Summarizing: M[OddEven − LineSort]invariant = O(nN ).

(5.22)

That is, using Protocol OddEven-LineSort can costs as much as broadcasting all the data to every entity. This results holds even if the data is initially equidistributed. Similar bad message costs can be derived for equidistributed sorting and compacted sorting. Summarizing, Protocol OddEven-LineSort does not appear to be very efﬁcient. IMPORTANT. Each line network is ordered according to a permutation. However, this permutation might not be π, according to which we need to sort the data. What happens in this case? The protocol OddEven-LineSort does not work if the entities are not positioned on the line according to π, that is, when the line is not ordered according to π . (Exercise 5.6.30). The question then becomes how to sort a set distributed on an unsorted line. We will leave this question open until later in this chapter. 5.3.3 Removing the Topological Constraints: Complete Graph One of the problems we have faced in the the line graph is the constraint that the topology of the network imposes. Indeed, the line graph is one of the worst topologies for a tree, as its diameter is n − 1. In this section we will do the opposite: We will consider the complete graph, where every entity is directly connected to every other entity; in this way, we will be able to remove the constraints imposed by the network topology. Without loss of generality (since we are in a complete network), we assume π = 1, 2, . . . , n. As the complete graph contains every graph as a subgraph, we can choose to operate on whichever graph suites best our computational needs. Thus, for example, we can choose an ordered line and use protocol OddEven-LineSort we discussed before. However, as we have seen, this protocol is not very efﬁcient. If we are in a complete graph, we can adapt and use some of the well known techniques for serial sorting. Let us focus on the classical Merge-Sort strategy. This strategy, in our distributed setting becomes as follows: (1) the distribution to be sorted is ﬁrst divided in two partial distributions of equal size; (2) each of these two partial distribution is independently sorted recursively using MergeSort; and (3) then the two sorted partial distributions are merged to form a sorted distribution. The problem with this strategy is that the last step, the merging step, is not an obvious one in a distributed setting; in fact, after the ﬁrst iteration, the two sorted distributions

304

DISTRIBUTED SET OPERATIONS

to be merged are scattered among many entities. Hence the question: How do we efﬁciently “merge” two sorted distributions of several sets to form a sorted distribution? There are many possible answers, each yielding a different merge-sort protocol. In the following we discuss a protocol for performing distributed merging by means of the odd-even strategy we discussed for the ordered line. Let us ﬁrst introduce some terminology. We are given a distribution D = D1 , . . . , Dn . Consider now a subset {Dj1 , . . . , Djq } of the data sets, where ji < ji+1 (1 ≤ i ≤ q). The corresponding distribution D = Dj1 , . . . , Djq is called a partial distribution of D. We say that the partial distribution d is sorted (according to π = 1, . . . , n) if all the items in Dji are smaller that the items in Dji+1 , 1 ≤ i < q. Note that it might happen that D is sorted while D is not. Let us now describe how to odd-even-merge a sorted partial distribution A1 , . . . , A p with a sorted partial distribution A p +1 , . . . , Ap to form a sorted 2 2 distribution A1 , . . . , Ap , where we are assuming for simplicity that p is a power of 2. OddEven-Merge Technique: 1. If p = 2, then there are two sets A1 and A2 , held by entities y1 and y2 , respectively. To odd-even-merge them, each of y1 and y2 sends its data to the other entity; y1 retains the smallest while y2 retains the largest items. We call this basic operation simply merge. 2. If p > 2, then the odd-even-merge is performed as following: (a) ﬁrst recursively odd-even-merge the distribution A1 , A3 , A5 , . . . , A p −1 2 with the distribution A p +1 , A p +3 , A p +5 , . . . , Ap−1 ; 2

2

2

2

2

2

(b) then recursively odd-even-merge the distribution A2 , A4 , A6 , . . . , A p 2 with the distribution A p +2 , A p +4 , A p +6 , . . . , Ap ; (c) ﬁnally, merge A2i with A2i+1 (1 ≤ i ≤

p 2

− 1)

The technique OddEven-Merge can then be used to generate the OddEven-MergeSort technique for sorting a distribution D1 , . . . , Dn . As in the classical case, the technique is deﬁned recursively as follows: OddEven-MergeSort Technique: 1. recursively odd-even-merge-sort the distribution D1 , . . . , D n2 , 2. recursively odd-even-merge-sort the distribution D n2 +1 , . . . , Dn 3. odd-even-merge D1 , . . . , D n2 with D n2 +1 , . . . , Dn Using this technique, we obtain a protocol for sorting a distribution D1 , . . . , Dn ; we shall call this protocol like the technique itself: Protocol OddEven-MergeSort. To determine the communication costs of this protocol need to “unravel” the recursion.

SORTING A DISTRIBUTED SET

305

x1 x2 x3 x4 x5 x6 x7 x8

FIGURE 5.11: Diagram of operations of OddEven-MergeSort with n = 8.

When we do this, we realize that the protocol is a sequence of 1 + log n iterations (Exercise 5.6.32). In each iteration (except the last) every entity is paired with another entity, and each pair will perform a simple merge of their local sets; half of the entities will perform this operation twice during an iteration. In the last iteration all entities, except x1 and xn , will be paired and perform a merge. Example Using the sorting diagram to describe these operations, the structure of an execution of Protocol OddEven-MergeSort when n = 8 is shown in Figure 5.11. Notice that there are 4 iterations; observe that, in iteration 2, merge will be performed between the pairs (x1 , x3 ), (x2 , x4 ), (x5 , x7 ), (x6 , x8 ); observe further that entities x2 , x3 , x6 , x7 will each be involved in one more merge in this same iteration. Summarizing, in each of the ﬁrst log n iterations, each entity sends is data to one or two other entities. In other words the entire distributed set is transmitted in each iteration. Hence, the total number of messages used by Protocol OddEven-MergeSort is M[OddEven − MergeSort] = O(N log n).

(5.23)

Note that this bound holds regardless of the storage requirement. IMPORTANT. Does the protocol work ? Does it in fact sorts the data ? The answer to these questions is: not always. In fact, its correctness depends on several factors, including the storage requirements. It is not difﬁcult to prove that the protocol correctly sorts, regardless of the storage requirement, if the initial set is equidistributed (Exercise 5.6.33).

306

DISTRIBUTED SET OPERATIONS

{4, 8}

{4, 6}

{1, 4}

{1, 4}

{6}

{8}

{3}

{3}

{7}

{1}

{6}

{6}

{1, 3}

{3, 7}

{7, 8}

{7, 8}

x1 x2 x3 x4

FIGURE 5.12: OddEven-MergeSort does not correctly perform an invariant sort for this distribution.

Property 5.3.4 OddEven-MergeSort sorts any equidistributed set if the required sorting is (a) invariant-sized, (b) equidistributed, or (c) compacted. However, if the initial set is not equidistributed, the distribution obtained when the protocol terminates might not be sorted. To understand why, consider performing an invariant sorting in the system of n = 4 entities shown in Figure 5.12; items 1 and 3, initially at entity x4 , should end up in entity x1 , but item 3 is still at x4 when the protocol terminates. The reason for this happening is the “bottleneck” created by the fact that only one item at a time can be moved to each of x2 and x3 . Recall that the existence of bottlenecks was the reason for the high number of iterations of Protocol OddEven-LineSort. In this case, the problem makes the protocol incorrect. It is indeed possible to modify the protocol, adding enough appropriate iterations, so that the distribution will be correctly solved. The type and the number of the additional iterations needed to correct the protocol depends on many factors. In the example shown in Figure 5.12, a single iteration consisting of a simple merge between x1 and x2 would sufﬁce. In general, the additional requirements depend on the speciﬁcs of the size of the initial sets; see, for example, Exercise 5.6.34. 5.3.4 Basic Limitations In the previous sections we have seen different protocols, examined their behavior, and analyzed their costs. In this process we have seen that the amount of data items transmitted can be very large. For example, in OddEven-LineSort the number of messages is O(Nn), the same as sending every item everywhere. Even not worrying about the limitations imposed by the topology of the network, protocol OddEvenMergeSort still uses O(N log n) messages when it works correctly. Before proceeding any further, we are going to ask the following question: How many messages need to be sent anyway? we would like the answer to be independent of the protocol but to take into account both the topology of the network and the storage requirements. The purpose of this section is to provide such an answer, to use it to assess the solutions seen so far, and to understand its implications. On the basis of this, we will be able to design an efﬁcient sorting protocol. Lower Bound There is a minimum necessary amount of data movements that must take place when sorting a distributed set. Let us determine exactly what costs must be incurred regardless of the algorithm we employ.

SORTING A DISTRIBUTED SET

307

The basic observation we employ is that, once we are given a permutation π according to which we must sort the data, there are some inescapable costs. In fact, if entity x has some data that according to π must end up in y, then this data must move from x to y, regardless of the sorting algorithm we use. Let us state these concepts more precisely. Given a network G, a distribution D = D1 , . . . , Dn of D on G, and a permutation π let D = D1 , . . . , Dn be the result of sorting D according to π . Then |Di ∩ Dj | items must travel from xi to xj ; this means that the amount of data transmission for this transfer is at least |Di ∩ Dj | dG (xi , xj ). How this amount translates into number of messages depends on the size of the messages. A message can only contain a (small) constant number of data items; to obtain a uniform measure, we consider just one data item per message. Then Theorem 5.3.1 The number of messages required to sort D according to π in G is at least |Di ∩ Dj | dG (xi , xj ). C(D, G, π) = i=j

This expresses a lower bound on the amount of messages for distributed sorting; the actual value depends on the topology G and the storage requirements. The determination of this value in speciﬁc topologies for different storage requirements is the subject of Exercises 5.6.35–5.6.38. Assessing Previous Solutions Let us see what this bound means for situations we have already examined. In this bound, the topology of the network plays a role through the distances dG (xi , xj ) between the entities that must transfer data, while the storage requirements play a role through the sizes |Di | of the resulting sets. First of all, note that, by deﬁnition, for all xi , xj , we have dG (xi , xj ) ≤ d(G); furthermore,

|Di ∩ Dj | ≤ N.

(5.24)

i=j

To derive lower bounds on the number of messages for a speciﬁc network G, we need to consider for that network the worst possible allocation of the data, that is, the one that maximizes C(D, G, π ). Ordered Line. OddEven-LineSort Let us focus ﬁrst on the ordered line network.

308

DISTRIBUTED SET OPERATIONS

If the data is not initially equidistributed, it easy to show scenarios where O(N ) data must travel a O(n) distance along the line. For example, consider the case when xn initially contains the smallest N − n + 1 items while all other entities have just a single item each; for simplicity, assume (N − n + 1)/n to be integer. Then for equidistributed sorting we have |Dn ∩ Dj | = (N − n + 1)/n for j < n; this means that at least j n, for example, when N ≥ n2 log n. In contrast, protocol OddEven-MergeSort has always worst-case cost of O(N log n), and it might even not sort. The determination of the cost of protocol SelectSort in speciﬁc topologies for different storage requirements is the subject of Exercises 5.6.41–5.6.48. 5.3.6 Unrestricted Sorting In the previous section we have examined the problem of sorting a distributed set according to a given permutation. This describes the common occurrence when there is some a priori ordering of the entities (e.g., of their ids), according to which the data must be sorted. There are, however, occurrences where the interest is to sort the data with no a priori restriction on what ordering of the sites should be used. In other words, in these cases, the goal is to sort the data according to a permutation. This version of the problem is called unrestricted sorting. Solving the unrestricted sorting problem means that we, as designers, have the choice of the permutation according to which we will sort the data. Let us examine the impact of this choice in some details. We have seen that, for a given permutation π , once the storage requirement is ﬁxed, there is an amount of message exchanges that must necessarily be performed to transfer the records to their destinations; this amount is expressed by Theorem 5.3.1. Observe that this necessary cost is smaller for some permutations than for others. For example, assume that the data is initially equidistributed sorted according to π1 = 1, 2, . . . , n, where n is even. Obviously, there is no cost for an equidistributed sorting of the set according to π1 , as the data is already in the proper place. By contrast, if we need to sort the distribution according to π2 = n, n − 1, . . . , 2, 1, then, even with the same storage requirement as before, the operation will be very costly: At least N messages must be sent, as every data item must necessarily move.

SORTING A DISTRIBUTED SET

313

Thus, it is reasonable to ask that the entities choose the permutation π , which minimizes the necessary cost for the given storage requirement. For this task, we express the storage requirements as a tuple k = k1 , k2 , . . . , kn where kj ≤ w and 1≤j ≤n kj = N : The sites of the sorted distribution D must be such that |Dπ(j ) | = kj . Notice that this generalized storage requirement includes both the compacted (i.e., kj = w) and equidistributed (i.e., kj = N/d) ones, but not necessarily the identical requirement. More precisely, the task we are facing, called dynamic sorting, is the following: given the distribution D, a requirement tuple k = k1 , k2 , . . . , kn , we need to determine the permutation π such that, ∀π,

n n

|Di ∩ Dj (π)| dG (xi , xj ) ≤

i=1 j =1

n n

|Di ∩ Dj (π )| dG (xi , xj ) (5.27)

i=1 j =1

where D (π) = D1 (π), D2 (π), . . . , Dn (π) is the resulting distribution sorted according to π. To determine π we must solve an optimization problem. Most optimization problems, although solvable, are computationally expensive as they are in NP. Surprisingly, and fortunately, our problem is not. Notice that there might be more than one permutation achieving such a goal; in this case, we just choose one (e.g., the alphanumerically smallest). To determine π we need to minimize the necessary cost over all possible permutations π . Fortunately, we can do it without having to determine each D (π ). In fact, regardless of which permutation we eventually determine to be π , because of the storage requirements we know that kj = |Dπ(j )|

items data items must end up in xπ(j ) , 1 ≤ j ≤ n. Hence, we can determine which of xi must be sent to xπ(j ) even without knowing π . In fact, let bj = D[ l≤j kl ] be the (k1 + . . . + kj )th smallest item overall; then all the items d with bj −1 < d ≤ bj must be sent to xπ(j ) . In other words, Di,π(j ) = Di ∩ Dπ(j ) = {d ∈ Di : bj −1 < d ≤ bj }.

This means that we can use the same technique as before: the entities collectively determine the items b1 , b2 , . . . bn employing a distributed selection protocol; then each entity xi uses these values to determine which of its own data items must be sent to xπ(j ) . To be able to complete the task, we do need to know which entity is xπ(j ) , that is, we need to determine π. To this end, observe that we can rewrite expression 5.27 as ∀π,

n n i=1 j =1

|Di,π(j ) | dG (xi , xπ(j ) ) ≤

n n i=1 j =1

|Di,π(j ) | dG (xi , xπ(j ) ).

(5.28)

314

DISTRIBUTED SET OPERATIONS

Strategy DynamicSelectSort begin for j = 1, . . . , n − 1 do Collectively determine bj = D[kj ] using distributed selection; Di,j := {d ∈ Di : bj −1 < d ≤ bj }; ni (j ) := |Di,j |; endfor Di,n := {d ∈ Di : bn−1 < d}; ni (n) := |Di,n |; if xi = x then send ni (1), . . . , ni (n) to x; else wait until receive information from all entities; determine π and notify all entities; endif send Di (j ) to xπ (j ) , 1 ≤ j ≤ n; end FIGURE 5.14: Strategy DynamicSelectSort.

Using this fact, π can be determined in low polynomial time once we know the sizes |Di,π(j ) | as well as the distances dG (x, y) between all pair of entities (Exercise 5.6.49). Therefore, our overall solution strategy is the following: First each entity xi determines the local sets Di (j ) using distributed selection; then, using information about the sizes |Di,j | of those sets and the distances dG (x, y) between entities, a single entity x determines the permutation π that minimizes Expression 5.28; ﬁnally, once π is made known, each entity send the data to their ﬁnal destination. A high level description is shown in Figure 5.14. Missing from this description is the collection at the coordinator x of the distance information; this can be achieved simply by having each entity x send to x the distances from its neighbors N (x). Once all details have been speciﬁed, the resulting Protocol DynamicSelectSorting will enable to sort a distribution according to the permutation, unknown a priori, that minimizes the necessary costs. See Exercise 5.6.50. The additional costs of the protocol are not difﬁcult to determine. In fact, Protocol DynamicSelectSorting is exactly the same as Protocol SelectSort with two additional operations: (1) the collection at x of the distance and size information, and (2) the notiﬁcation by x of the permutation π. The ﬁrst operation requires |N (xi )| + n items of information to be sent by each entity x to x: The |N (xi )| distances from its neighbors and the n sizes |Di,π(j ) |. The second operation consists on sending π which is composed of n items of information. Hence, the cost incurred by Protocol DynamicSelectSorting in addition to that of Protocol SelectSort is: x

(|N (x)| + 2n) dG (x, x).

(5.29)

DISTRIBUTED SETS OPERATIONS

315

Notice that this cost does not depend on the size N of the distributed set, and it is less than the total additional costs of Protocol SelectSort. This means that, with twice the additional cost of Protocol SelectSort, we can sort minimizing the necessary costs. So for example, if the data was already sorted according to some unknown permutation, Protocol DynamicSelectSorting will recognize it, determine the permutation, and no data items will be moved at all. 5.4 DISTRIBUTED SETS OPERATIONS 5.4.1 Operations on Distributed Sets A key element in the functionality of distributed data is the ability to answer queries about the data as well as about the individual sets stored at the entities. Because the data is stored in many places, it is desirable to answer the query in such a way as to minimize the communication. We have already discussed answering simple queries such as order statistics. In systems dealing mainly with distributed data, such as distributed database systems, distributed ﬁle systems, distributed objects systems, and so forth the queries are much more complex, and are typically expressed in terms of primitive operations. In particular, in relational databases, a query will be an expression of join, project, and select operations. These operations are actually operations on sets and can be re-expressed in terms of the traditional operators intersection, union, and difference between sets. So to answer a query of the form “Find all the computer science students as well as those social science students enrolled also in anthropology but not in sociology”, we will need to compute an expressions of the form A ∪ ((B ∩ C) − (B ∩ D))

(5.30)

where A, B, C, and D are the sets of the students in computer science, social sciences, anthropology, and sociology, respectively. Clearly, if these sets are located at the entity x where the query originates, that entity can locally compute the results and generate the answer. However, if the entity x does not have all the necessary data, x will have to involve other entities causing communication. It is possible that each set is actually stored at a different entity, called the owner of that set, and none of them is at x. Even assuming that x knows which entities are the owners of the sets involved, there are many different ways and approaches that can be used to perform the computation. For example, all those sets could be sent by the owners to x, which will then perform the operation locally and answer the query. With this approach, call it A1, the volume of data items that will be moved is Vol(A1) = |A| + |B| + |C| + |D| . The actual number of messages will depend on the size of these sets as well as on the distances between x(A), x(B), x(C), x(D), and x, where x(·) denotes the owner

316

DISTRIBUTED SET OPERATIONS

of the speciﬁed set. In some cases, for example in complete networks, the number of messages is given precisely by these sizes. Another approach is to have x(B) sending B to x(C); x(C) will then locally compute B ∩ C and send it to x(D), which will locally compute (B ∩ C) − (B ∩ D) = (B ∩ C) − D and send it to x(A) that will compute the ﬁnal answer and send it to x. The amount of data moved with this approach, call it A2, is Vol(A2) = |B| + |B ∩ C| + |(B ∩ C) − D| + |A ∪ ((B ∩ C) − D)|. Depending on the sizes of the sets resulting from the partial computations, A1 could be better than A2. Other approaches can be devised, each with its own cost. For example, as (B ∩ C) − D = B ∩ (C − D), we could have x(C) send C to x(D), which will use it to compute C − D and send the result to x(B); if we also have x(A) send A to x(B), x(B) can compute Expression 5.30, and send the result to x. The volume of transmitted items with this approach, call it A3, will be Vol(A3) = |C| + |C − D| + |A| + |A ∪ ((B ∩ C) − D)| . IMPORTANT. In each approach, or strategy, the original expression is broken down into subexpressions, each to be evaluated just at a single site. For example, in approach A2 expression 5.30 is decomposed into three sub-expressions: E1 = (B ∩ C) to be computed by x(C), E2 = E1 − D to be computed by x(D), and E3 = A ∪ E3 to be computed by x(A). A strategy also speciﬁes, for each entity involved in the computation, to what other sites it must send its own set or the results of local evaluations. For example, in approach A2, x(B) must send B to x(C); x(C) must send E1 to x(D); x(D) must send E2 to x(A); and x(A) must send E3 to the originator of the query x. As already mentioned, the amount of items transferred by a strategy depends on the size of the results of the subexpressions (e.g., |B ∩ C|). Typically these sizes are not known a priori; hence, it is in general impossible to know beforehand which of these approaches is better from a communication point of view. In practice, estimates are used on those sizes to decide the best strategy to use. Indeed, a large body of studies exists on how to estimate the size of an intersection or a union or a difference of two or more sets. In particular, an entire research area, called distributed query processing, is devoted to the study of the problem of computing the “best” strategy, and related problems. We can, however, express a lower bound on the number of data that must be moved. As the entity x where the query originates must provide the answer, then, assuming x has none of the sets involved in the query, it must receive the entire answer. That is Theorem 5.4.1 For every expression E, if the set of the entity x where the query originates is not involved in the expression, then for any strategy S Vol(S) ≥ |E|.

DISTRIBUTED SETS OPERATIONS

317

What we will examine in the rest of this section is how we can answer queries efﬁciently by cleverly organizing the local sets. In fact, we will see how the sets can be locally structured so that the computations of those subexpressions (and, thus, the answer to those queries) can be performed minimizing the volume of data to be moved. To perform the structuring, there is need of some information at each entity; if not available, it can be computed in a prestructuring phase. 5.4.2 Local Structure We ﬁrst of all see how we can structure at each entity xi the local data Di so to answer operations of intersections and differences with the minimum amount of communication. The method we use to structure a local set is called Intersection Difference Partioning (IDP). The idea of this method is to store each set Di as a collection Zi of disjoint subsets such that operations of union, intersection, and difference among the data sets can be computed easily, and with the least amount of data transfers. Let us see precisely how we construct the partition Zi of the data set Di . For simplicity, let us momentarily rename the other n − 1 sets Dj (j = i) as S1 , S2 , . . . , Sn−1 . Let us start with the entire set i = Di . Z0,1

(5.31)

i = D ∩ S and Z i = D − S . We ﬁrst of all partition it into two subsets: Z1,1 i 1 i 1 1,2 i Then recursively, we partition Zl,j into two subsets: i i Zl+1,2j −1 = Zl,j ∩ Sl+1

(5.32)

i i = Zl,j − Sl+1 . Zl+1,2j

(5.33)

i ’s; these sets form exactly We continue this process until we obtain the sets Zn−1,j i simply as Zji ; hence the partition of Di we need. For simplicity, we will denote Zn−1,j the ﬁnal partition of Di will be denote by i Zi = Z1i , Z2i , . . . , Zm

(5.34)

where m = 2n−1 . Example Consider the three sets D1 = {a, b, e, f, g, m, n, q}, D2 = {a, e, f, g, o, p, r, u, v} and D3 = {e, f, p, r, m, q, v} stored at entities x1 , x2 , x3 , respectively. i = D ∪ D = {a, e, f, g} and Let us focus on D1 ; it is ﬁrst subdivided into Z1,1 1 2 i Z1,2 = D1 − D2 = {b, m, n, q}. These are then subdivided creating the ﬁnal partition 1 = {e, f }, Z 1 = {a, g}, Z 1 = {m, q}, and Z 1 = {b, n}. Z1 composed of Z2,1 2,2 2,3 2,4

318

DISTRIBUTED SET OPERATIONS

D1 = {a, b, e, f, g, m, n, q}

{a, e, f, g}

{e, f}

D2 = {a, e, f, g, o, p, r, u, v}

{b, m, n, q}

{a, g} {m, q}

{b, n}

{a, e, f, g}

{o, p, r, u, v}

{a, g} {p, r, v}

{e, f}

{o, u}

D3 = {e, f, m, p, q, r, v}

{e, f, m, q}

{e, f}

{p, r, v}

{m, q} {p, r, v}

{}

FIGURE 5.15: Trees created by DSP.

This recursive partitioning of the set Di creates a binary tree Ti . The root (considered to be at level 0) corresponds to the entire sets Di . Each node in the tree i ’s) of this set; note that this subset is possibly corresponds to a subset (one of the Zl,j empty. For a node at level l − 1 corresponding to subset S, its left child corresponds to the subset S ∩ Sj while the right child corresponds to the subset S − Sj . The trees for the three sets of the example above are shown in Figure 5.15. Notice that at each level of the tree (including the last level l = n − 1), the entire set is represented:

i Property 5.4.1 Di = (1≤j ≤2l ) Zl,j i , Z i , . . . , Z i is a partition of D . In other words, Zl,1 i l,2 l,2l Further observe that each level l ≥ 1 of the tree describes the relationship between elements of Di and those in the set Sl . In particular, the sets corresponding to the left children of level l are precisely the elements in common between Di and Sl :

i Property 5.4.2 (1≤j ≤2l−1 ) Zl,2j −1 = Di ∩ Sl

By contrast, the sets corresponding to the right children of level l are precisely the elements in Di that are not part of Sj :

i Property 5.4.3 (1≤j ≤2l−1 ) Zl,2j = Di − Sl i ’s), This means that, if we were to store at xi the entire tree Ti (i.e., all the sets Zl,j then xi can immediately answer any query of the form Di − Dj and Di ∩ Dj for

DISTRIBUTED SETS OPERATIONS

319

any j . In other words, if each xi has available its tree Ti then any query of the form Di − Dj and Di ∩ Dj can be answered by xi without any communication. We are going to see now that it is possible to achieve the same goal storing at xi only the last partition Zi (i.e., the leaves of the tree). Observe that each level l of the tree contains not only the entire set Di but also information about the relationship between Di and all the sets S1 , S2 , . . . , Sl . In particular, the last level l = n − 1 (i.e., the ﬁnal partition), contains information about the relationship between Di and all the other sets. More precisely, the information contained in each node of the tree Ti is also contained in the ﬁnal partition and can be reconstructed from there: i = Property 5.4.4 Zl,j

(1≤k≤2n−1−l )

Zki + (j −1) 2n−1−l

Summarizing, each entity xi structures its local set Di as the collection Zi = i of disjoint subsets created using the IDP method. This collection Z1i , Z2i , . . . , Zm contain all the information contained in each node of the tree Ti . IMPORTANT. Notice that when structuring Di as the partition Zi , the number of data items stored at xi is still |Di |, that is, no additional data items are stored anywhere. 5.4.3 Local Evaluation () Locally Computable Expressions If each xi stores its set Di as the partition Zi , then each entity is immediately capable of computing the result of many expressions involving set operations. For example, we know that the partition Zi contains all the information contained in each node of the tree Ti (Property 5.4.4), thus, by Properties 5.4.2 and 5.4.3 it follows that xi can answer without any communication any query of the form Di − Dj and Di ∩ Dj . In fact,

Di ∩ Sl =

(1≤j ≤2l−1 ,

Di − S l =

Zki + (j −1) 2n−l

(5.35)

Zki + (2j −1) 2n−l−1 .

(5.36)

1≤k≤2n−1−l )

(1≤j ≤2l−1 , 1≤k≤2n−1−l )

Actually, xi has locally available the answer to any expression composed of differences and intersections, involving any number of sets, provided that Di is the left operand in the differences involving Di . So for example, the query (D1 − D2 ) ∩ (D3 − (D4 ∩ D5 )) can be answered immediately both at x1 and x3 (see Exercise 5.6.51). Also some queries involving unions as well as intersections and differences can be answered immediately and locally. For example, both (D1 − (D2 ∩ D3 )) and ((D1 − D2 ) ∩ (D1 ∪ D3 )) can be answered by x1 .

320

DISTRIBUTED SET OPERATIONS

Exactly what expressions can be answered by xi ? To answer this question, observe the following: if expression E can be answered locally by xi , then xi can answer also E ∩ E and E − E , where E is an arbitrary expression on the local sets; if two expressions E1 and E2 can be answered locally by xi , so can be the expressions E1 ∪ E2 .

Using these two facts and starting with Di , we can characterize the set E(xi ) of all the expressions that can be answered by xi directly without communication. Local Evaluation Strategy Let us see now how can xi determine the answer to a query in E(xi ) from the information stored in the ﬁnal partition Zi = i , where m = 2n−1 . Z1i , Z2i , . . . , Zm First of all, let us introduce some terminology. We will call address of Zji the Boolean representation b(j ) of j − 1 using n − 1 bits, for example, in Figure 5.15, 1 = {m, q} has address 10, while 11 is the address of the subset the subset Z2,3 1 Z2,4 . An expression on k operands is sequential if it is of the form ((. . . (((O1 o1 O2 ) o2 O3 ) o3 O4 ) . . .) ok−1 Ok ) where the Oj are the operands and oj are the set operators. An example of a sequential expression is (((A ∪ B) − C) ∪ B). First consider the set E − (xi ) ⊂ E(xi ) of sequential expressions in E(xi ) where 1. Di is the ﬁrst operand, 2. each of the other sets Sj appears at most once, and 3. the only operators are intersection and difference. For example, the expression (((Di ∩ S3 ) − S1 ) ∩ S2 ) is in E − (xi ). To answer queries in E − (xi ) there is a simple strategy that xi can follow: Strategy Bitmask 1. Create a bitmask of size n − 1. 2. For each set Sj (a) if Sj is the right operand of an intersection operator, then place 0 in the jth position of the bitmask; (b) if Sj is the right operand of a difference operator, then place a 1 in the jth position of the bitmask; (c) if Sj is not involved in the query at all, place the wildcard symbol in the jth position of the bitmask.

DISTRIBUTED SETS OPERATIONS

321

3. Perform the union of all the subsets in the ﬁnal partition whose address matches the pattern of the bitmask, where wildcard symbol is matched both by 0 and 1. Example The bitmask associated to expression (((Di ∩ S3 ) − S1 ) ∩ S4 )

(5.37)

when n = 6 will be 0 0 1. Entity xi will then calculate the union of the sets in its ﬁnal partition Zi whose addresses match the bitmask; that is, the sets with address 00001, 00011, 10001, 10011. Thus, to answer query (5.37), xi will just calculate i i ∪ Z36 . Z2i ∪ Z4i ∪ Z34

(5.38)

It is not difﬁcult to verify that indeed by calculating (5.38) we obtain the answer to precisely query (5.38); in fact, the Evaluation Strategy Bitmask is correct (Exercise 5.6.53). Summarizing, using strategy Bitmask entity xi can directly evaluate any expression in E − (xi ); those are, however, only a small subset of all the expressions in E(xi ). Let us now examine how to extend to all queries in E(xi ) the result we have just obtained. The key to the extension is the fact that any expression of E(x) can be re-expressed as the union of sub-expressions in E − (xi ) (Exercise 5.6.54). Property 5.4.5 For every Q ∈ E(x) there are Q(1), . . . , Q(k) ∈ E − (xi ), k ≥ 1, such that Q = 1≤j ≤k Q(j ). For example, (Di − (S2 ∪ S4 )) can be re-expressed as (Di − S2 ) ∪ (Di − S4 ). Similarly ((S1 ∩ S2 ) ∪ Di ) − (S4 ∩ S5 ) = ((Di ∪ S1 ) − S4 − S5 ) ∩ ((Di ∪ S2 ) − S4 − S5 ). Thus, to answer a query in E(xi ), entity xi will ﬁrst re-formulate it as union of expressions in E − (xi ), evaluate each of them using strategy Bitmask and then perform their union. Strategy Local Evaluation 1. Re-formulate Q as union of expressions Q(1), . . . , Q(k) in E − (xi ). 2. Evaluate each Q(j ) using strategy Bitmask. 3. Perform the union of all the obtained results. Notice that all this can be done by xi locally, without any communication.

322

DISTRIBUTED SET OPERATIONS

5.4.4 Global Evaluation Let us now examine the problem of answering a query Q originating at an entity x once every local set Di has been stored as the partition Z i . If the query can be answered directly (i.e., Q ∈ E(x)), x will do so. Otherwise, the query will be decomposed into subqueries that can be locally evaluated at one or more entities, the results of these partial evaluations are then collected at x so that the original query can be answered. Our goal is to ensure that the volume of data items to be moved is minimized. To achieve this goal, we use the following property Property 5.4.6 For every expression Q there are k ≤ n subexpressions Q(1), Q(2), . . . , Q(k) such that 1. ∀Q(j ) ∃yj Q(j ) ∈ E(yj ), 2. Q(i) ∩ Q(j ) = ∅ for i = j ,

3. Q = 1≤j ≤k Q(j ). That is, any query Q can be re-expressed as the union of subqueries Q(1), . . . , Q(k), where each subquery can be answered directly by just one entity, once its local set has been stored using the partitioning method; furthermore, the answer to any two different subqueries is disjoint (Exercise 5.6.55). This gives raise to our strategy for evaluating an arbitrary query: Strategy Global 1. x decomposes Q into Q(1), Q(2), . . . , Q(k) satisfying Property 5.4.6, and informs each yj of Q(j ); 2. yj locally and directly evaluates Q(j ) and sends the result to x; and 3. x computes the union of all the received items. To understand the advantages of this strategy, let us examine again the implications of Property 5.4.6. As the results of any two subqueries are disjoint, while the union of all results of the subqueries is precisely what we are asking for, we have that: Property 5.4.7 Let Q(1), Q(2), . . . , Q(k) satisfy Property 5.4.6 for Q. Then |Q| =

1≤j ≤k

|Q(j )|.

This means that, for every query Q, in our Strategy Global the only data items that might be moved to x are those in the ﬁnal answer, that is, Vol[Global] ≤ |Q|.

BIBLIOGRAPHICAL NOTES

323

In other words, strategy Global is optimal. This optimality is with regards to the amount of data items that will be moved. There are different possible decompositions of a query Q into subqueries satisfying Property 5.4.6. All of them are equally acceptable to our strategy, and they all provide optimal volume costs. IMPORTANT. To calculate the cost in terms of messages we need to take into account also the distances between the nodes in the network. In this regard, some decompositions may be better than others. The problem of determining the decomposition that requires less messages is a difﬁcult one, and no solution is known till date. 5.4.5 Operational Costs An important consideration is that of the cost of setting up the ﬁnal partitions at each entity. Once in this format, we have seen how complex queries can be handled with minimal communication. But to get it in this format requires communication; in fact each entity must somehow receive information from all the other entities about their sets. In a complete network this can require just a single transmission of each set to a predetermined coordinator that will then compute and send the appropriate partition to each entity; hence, the total cost will be O(N ) where N is the total amount of data. By contrast, in a line network the total cost can be as bad as O(N 2 ), for example, if all sets have almost the same size. It is true that this cost is incurred only once, at set-up time. If the goal is only to answer a few queries, the cost of setup may exceed that of simply performing the queries without using the partitioned sets. But for persistent distributed data, upon which many queries may be placed, this is an efﬁcient solution. Another consideration is that of the addition or removal of data from the distributed sets. As each entity contains some knowledge about the contents of all other entities, any time an item is added to or removed from one of the sets, every entity must update its partition to reﬂect this fact. Fortunately, the cost of doing this does not exceed the cost of broadcasting the added (or removed) item to each entity. Clearly this format is more effective for slowly changing distributed data sets. 5.5 BIBLIOGRAPHICAL NOTES The problems of distributed selection and distributed sorting were studied for a small set by Greg Frederickson in special networks (exercises 5.6.1–5.6.3) [4], and by Shmuel Zaks [23]. Always in a small set, the cost using bounded messages and, thus, the bit complexity has been studied by Mike Loui [8] in ring networks; by Ornan Gerstel, Yishay Mansour, and Shmuel Zaks in a star [5]; and in trees by Ornan Gerstel and Shmuel Zaks [6] , and by Alberto Negro, Nicola Santoro, and Jorge Urrutia [12]. Selection among two sites was ﬁrst studied by Michael Rodeh [14]; his solution was later improved by S. Mantzaris [10], and by Francis Chin and Hing Ting [3]. Reducing the expected costs of distributed selection has been the goal of several investigations. Protocol RandomSelect was designed by Liuba Shrira, Nissim Francez,

324

DISTRIBUTED SET OPERATIONS

and Michael Rodeh [21]. Nicola Santoro, Jeffrey Sidney, and Stuart Sidney designed Protocol RandomFlipSelect [19]. Protocol RandomRandomSelect is due to Nicola Santoro, Michael Scheutzow, and Jeffrey Sidney [17]. General selection protocols, with emphasis on the worst case, were developed by Doron Rotem, Nicola Santoro, and Jeffrey Sidney [16], and by Nicola Santoro and Jeffrey Sidney [18]. The more efﬁcient protocol Filter was developed by John Marberg and Eli Gafni [11]. The even more efﬁcient protocol ReduceSelect was later designed by Nicola Santoro and Ed Suen [19]. The sorting protocols Odd-Even Mergesort algorithm, on which Protocols OddEven-LineSort and OddEven-MergeSort are based, was developed by Kenneth Batcher [1]. The ﬁrst general distributed sorting algorithm is due to Lutz Wegner [22]. More recent but equally costly sorting protocols have been designed by To-Yat Cheung [2], and by Peter Hofstee, Alain Martin, and Jan van de Snepscheut [7]; experimental evaluations were performed by Wo-Shun Luk and Franky Ling [9]. The optimal SelectSort was designed by Doron Rotem, Nicola Santoro, Jeffrey B. Sidney [15], who also designed protocol DynamicSelectSort. Other protocols include those designed by Hanmao Shi and Jonathan Schaeffer [20]. There is an extensive amount of investigations on database queries, whose computation requires the use of distributed set operations like union, intersection and difference. The entire ﬁeld of distributed query processing is dedicated to this topic, mostly focusing on the estimation of the size of the output of a set operation and thus of the entire query. The IDP structure for minimum-volume operations on distributed sets was designed and analyzed in this context by Ekow Otoo, Nicola Santoro, Doron Rotem [13].

5.6 EXERCISES, PROBLEMS, AND ANSWERS 5.6.1 Exercises Exercise 5.6.1 () Consider a ring network where each entity has just one item. Show how to perform selection using O(n log3 n) messages. Exercise 5.6.2 () Consider a mesh network where each entity has just one item. 3 Show how to perform selection using O(n log 2 n) messages. Exercise 5.6.3 () Consider a network whose topology is a complete binary tree where each entity has just one item. Show how to perform selection using O(n log n) messages. Exercise 5.6.4 Prove that after discarding the elements greater than mx from Dx and discarding the elements greater than my from Dy , the overall lower median is the lower median of the elements still under considerations.

EXERCISES, PROBLEMS, AND ANSWERS

325

Exercise 5.6.5 Write protocol Halving so that it works with any two arbitrarily sized sets with the same complexity. Exercise 5.6.6 Prove that the K-selection problem can be reduced to a medianﬁnding problem regardless of K and of the size of the two sets. Exercise 5.6.7 Modify protocol Halving as follows: In iteration i, (a) discard from both Dxi and Dyi , all elements greater than max{mix , miy } and all those smaller than min{mix , miy }, where Dxi and Dyi denote the set of elements of Dx and Dy still under consideration at the beginning of stage i, and mix and miy denote their lower medians; (b) transform the problem again into a median ﬁnding one. Write the corresponding algorithm, GeneralHalving, prove its correctness, and analyze its complexity. Exercise 5.6.8 Implement protocol GeneralHalving of Exercise 5.6.7, throughly test it, and run extensive experiments. Compare the experimental results with the theoretical ones. Exercise 5.6.9 () Extend the technique of protocol Halving to work with three sets, Dx , Dy , and Dz . Write the corresponding protocol, prove its correctness, and analyze its complexity. Exercise 5.6.10 Random Item Selection () Modify the protocol of Exercise 2.9.52 so that it can be used to select uniformly at random an element still under consideration in each iteration of Strategy RankSelect. Your protocol should use at most 2(n − 1) + dT (s, x) messages and 2r(s) + dT (s, x) ideal time units in each iteration. Prove both correctness and complexity. Exercise 5.6.11 () Prove that the expected number of iterations performed by Protocol RandomSelect until termination is at most 1.387 log N + O(1). Exercise 5.6.12 () Determine the number of iterations if we terminate protocol RandomSelect, as soon as the search space contains at most cn items, where c is a ﬁxed constant. Determine the total cost of this truncated execution followed by an execution of protocol Rank. Exercise 5.6.13 Prove that in the worst case, the number of iterations performed by Protocol RandomFlipSelect until termination is N . Exercise 5.6.14 () Prove that the expected number of iterations performed by Protocol RandomFlip until termination is less than ln(⌬) + ln(n) + O(1).

326

DISTRIBUTED SET OPERATIONS

Exercise 5.6.15 () Determine the number of iterations if we terminate protocol RandomFlipSelect, as soon as the search space contains at most cn items, where c is a ﬁxed constant. Determine the total cost of this truncated execution followed by an execution of protocol Rank. Exercise 5.6.16 Write Protocol RandomRandomSelect ensuring that each iteration uses at most 4(n − 1) + r(s) messages and 5r(s) ideal time units. Implement the protocol and throughly test your implementation. Exercise 5.6.17 () Prove that the expected number of iterations performed by Protocol RandomRandomSelect until there are less than n items left under consideration is at most 43 log log ⌬ + 1 . Exercise 5.6.18 Prove that the number of iterations performed by Protocol Filter until there are no more than n elements left under consideration is at most 2.41 log(N/n). Exercise 5.6.19 Prove that in the execution of Protocol REDUCE, Local Contraction is executed at the most three times. Exercise 5.6.20 Prove that after the execution of Cutting Tool on C(l = 2i ), only the l − 1 columns C(1), C(2), . . . , C(l − 1) might remain unchanged; all others, including C(l) will have at least n − K/ l of the entries +∞. Exercise 5.6.21 Prove that after the execution of Protocol CUT there will be at most min{n, ⌬} log ⌬ items left under consideration. Exercise 5.6.22 Consider the system shown in Figure 5.9. How many items will x5 have (a) after a compacted sorting with w = 5? (b) after an equidistributed sorting? Justify your answer. Exercise 5.6.23 Prove that OddEven-LineSort performs an invariant-sized sort of an equidistribution on an ordered line. Exercise 5.6.24 () Prove that OddEven-LineSort performs an invariant-sized sort of any distribution on an ordered line. Exercise 5.6.25 () Prove that OddEven-LineSort performs a compacted sort of any distribution on an ordered line.

EXERCISES, PROBLEMS, AND ANSWERS

327

Exercise 5.6.26 () Prove that OddEven-LineSort performs an equidistributed sort of any distribution on an ordered line. Exercise 5.6.27 Prove that OddEven-LineSort sorts an equidistributed distribution in n − 1 iterations regardless of whether the required sorting is invariant-sized, equidistributed, or compacted with all entities having the same capacity. Exercise 5.6.28 Prove that there are some initial conditions under which protocol OddEven-LineSort uses N − 1 iterations to perform invariant-size sorting of N items distributed on a sorted line, regardless of the number n of entities. Exercise 5.6.29 Consider an initial equidistribution sorted according to permutation π = π(n), π (n − 1), . . . , π(1). Prove that, executing protocol OddEven-LineSort in this case, every data item will change location in each iteration. Exercise 5.6.30 Prove that when n > 3, if the line is not sorted according to π , then protocol OddEven-LineSort terminates but does not sort the data according to π . Exercise 5.6.31 Write the set of rules of protocol OddEven-MergeSort. Implement the protocol and throughly test it. Exercise 5.6.32 Prove that protocol OddEven-MergeSort is a sequence of 1 + log n iterations and that in each iteration (except the last) every data item is sent once or twice to another entity. Exercise 5.6.33 Prove that protocol OddEven-MergeSort correctly sorts, regardless of the storage requirement, if the initial set is equidistributed. Exercise 5.6.34 Consider an initial distribution where x1 and xn have the same number K = (N − n + 2)/2 of data items, while all other entities have just a single data item. Augment protocol OddEven-MergeSort so as to perform an invariant sort when π = 1, 2, . . . , n. Show the corresponding sorting diagram. How many additional simple merge operations are needed? How many operations does your solution perform? Determine the time and message costs of your solution. Exercise 5.6.35 For each of the three storage requirements (invariant, equidistributed, compacted) show a situation where ⍀(N ) messages need to be sent to sort in a complete network, even when the data are initially equidistributed. Exercise 5.6.36 Determine for each of the three storage requirements (invariant, equidistributed, compacted) a lower bound, in terms of n and N on the amount of necessary messages for sorting in a ring. What would be the bound for initially equidistributed sets?

328

DISTRIBUTED SET OPERATIONS

Exercise 5.6.37 () Determine for each of the three storage requirements (invariant, equidistributed, compacted) a lower bound, in terms of n and N on the amount of necessary messages for sorting in a labeled hypercube. What would be the bound for initially equidistributed sets? Exercise 5.6.38 () Determine for each of the three storage requirements (invariant, equidistributed, compacted) a lower bound, in terms of n and N on the amount of necessary messages for sorting in an oriented torus. What would be the bound for initially equidistributed sets? Exercise 5.6.39 Show how xπ(i) can ﬁnd out ki at the beginning of the ith iteration of strategy SelectSort. Initially, each entity knows only its index in the permutation (i.e., xπ(i) knows i) as well as the storage requirements. Exercise 5.6.40 Write the set of rules of Protocol SelectSort. Implement and test the protocol. Compare the experimental costs with the theoretical bounds. Exercise 5.6.41 Establish for each of the storage requirements the worst-case cost of protocol SelectSort to sort an equidistributed set in a ordered line. Determine under what conditions the protocol is optimal for this network. Compare this cost with the one of protocol OddEven-LineSort. Exercise 5.6.42 Establish for each of the storage requirements the worst-case cost of protocol SelectSort to sort a distributed set in a ordered line. Determine under what conditions the protocol is optimal for this network. Compare this cost with the one of protocol OddEven-LineSort. Exercise 5.6.43 Establish for each of the storage requirements the worst-case cost of protocol SelectSort to sort an equidistributed set in a ring. Determine under what conditions the protocol is optimal for this network (Hint: Use result of Exercise 5.6.36). Exercise 5.6.44 Establish for each of the storage requirements the worst-case cost of protocol SelectSort to sort a distributed set in a ring. Determine under what conditions the protocol is optimal for this network (Hint: Use result of Exercise 5.6.36). Exercise 5.6.45 Establish for each of the storage requirements the worst-case cost of protocol SelectSort to sort an equidistributed set in a labeled hypercube of dimension d. Determine under what conditions the protocol is optimal for this network (Hint: Use result of Exercise 5.6.37). Exercise 5.6.46 Establish for each of the storage requirements the worst-case cost of protocol SelectSort to sort a distributed set in a labeled hypercube of dimension d. Determine under what conditions the protocol is optimal for this network (Hint: Use result of Exercise 5.6.37).

EXERCISES, PROBLEMS, AND ANSWERS

329

Exercise 5.6.47 Establish for each of the storage requirements the worst-case cost of protocol SelectSort to sort an equidistributed set in a oriented torus of dimension p × q. Determine under what conditions the protocol is optimal for this network. (Hint: Use result of Exercise 5.6.38). Exercise 5.6.48 Establish for each of the storage requirements the worst-case cost of protocol SelectSort to sort a distributed set in a oriented torus of dimension p × q. Determine under what conditions the protocol is optimal for this network (Hint: Use result of Exercise 5.6.38). Exercise 5.6.49 Show how in strategy DynamicSelectSort the coordinator x can determine π from the received information in O(n3 ) local processing activities. Exercise 5.6.50 Write the set of rules of Protocol DynamicSelectSorting. Implement and test the protocol. Compare the experimental costs with the theoretical bounds. Exercise 5.6.51 Prove that the query (D1 − D2 ) ∩ (D3 − (D4 ∩ D5 )) can be answered immediately at both x1 and x3 if each of the sets is stored by its entity using the DSP method. Exercise 5.6.52 Show that expressions 5.38 and 5.38 are equal. Exercise 5.6.53 Prove that using strategy Bitmask, entity xi can directly evaluate any expression in E − (xi ). Exercise 5.6.54 () Prove Property 5.4.5: Any expression of E(x) can be reexpressed as the union of sub-expressions in E − (xi ). Exercise 5.6.55 () Prove Property 5.4.6. 5.6.2 Problems Problem 5.6.1 () Design a generic protocol to perform selection in a small set using o(n2 ) messages in the worst case. 5.6.3 Answers to Exercises Partial Answer to Exercise 5.6.4. Among the 2p−1 elements removed from consideration, exactly 2p−2 are greater than the median while exactly 2p−2 are smaller than the median. Answer to Exercise 5.6.13. Without loss of generality, let K ≤ N − K + 1. Then, for the ﬁrst N − 2K + 2 iterations, the adversary will choose d(i) to be the largest item in the search space. In this way, only d(i) will be removed from the search space in that iteration;

330

DISTRIBUTED SET OPERATIONS

furthermore, we still have K(i + 1) ≤ N (i + 1) − K(i + 1) + 1 where K(i) and N(i) are the rank of d ∗ and the size of the search space at the beginning of iteration i. As in these iterations we are removing only elements larger than d ∗ , after the N − 2K + 1 iterations d ∗ is the median of the search space. At this point, the adversary will alternate selecting d(i) to be the smallest item in the search space in one iteration and the largest item in the next one. In this way, only d(i) will be removed and d ∗ continues to be the (lower) median of the search space. Hence, the additional number of iterations is exactly 2K − 2, for a total of N iterations. Partial Answer to Exercise 5.6.18. Show that at least 1/4 of the items are removed from consideration at each iteration. Partial Answer to Exercise 5.6.19. Let K(j ) and N(j ) be the rank of f ∗ in the search space and the size of the search space at the end of iteration j of the while loop in Protocol REDUCE. Call an iteration a ﬂip if ⌬(j ) = N (j − 1) − ⌬(j − 1) + 1 < ⌬(j − 1). First of all observe that if the (j + 1)th iteration is not a ﬂip, then it is the last iteration. Let the (j + 1)th iteration be a ﬂip, and let q(j + 1) be the number of entities whose local search space is reduced in this iteration; q(j + 1) must be at least 1, otherwise the iteration would not be a ﬂip. We will show that q(j + 1) = 1. By contradiction, if q(j + 1) > 1, there must be at least two entities x and y that will have their search space reduced in iteration (j + 1). That is, N (x, j ) > ⌬(j ) and N (y, j ) > ⌬(j ) where N (x, j ) and N (y, j ) denote the number of items still under consideration at x and y, respectively, at the end of the jth iteration. Then N(j ) ≥ N(x, j ) + N (y, j ) ≥ 2⌬(j ). This means that N (j ) − ⌬(j ) + 1 > ⌬(j ), which implies that ⌬(j + 1) = min{⌬(j ), N (j ) − ⌬(j ) + 1} = ⌬(j ), contradicting the fact that iteration (j + 1) is a ﬂip. Hence, q = 1, that is, if iteration (j + 1) is a ﬂip, only one entity will reduce its search space in that iteration. To complete the proof, we must prove that the jth and the (j + 1)th iterations cannot both be ﬂips. Answer to Exercise 5.6.22. (a) none; (b) one. Answer to Exercise 5.6.28. Consider the initial condition where the initial distribution is sorted according to n, n − 1, . . . , 1. Let x1 and xn each contain (N − n + 2)/2 items, while all other entities have only one item each. Then trivially, in the each odd iteration only one item can leave x1 . Hence, the last item to move from x1 to xn will do so in the (N − n + 2)/2th odd iteration, which is the (N − n + 1)th iteration overall; this item reaches xn after an additional n − 2 iterations. Hence, the claimed N − 1 total number of iterations before termination. Answer to Exercise 5.6.30. Without loss of generality let π = 1, 2, . . . , n. If the line is not sorted according to π , then there is an entity xi whose neighbors in the line, y and z, have indices

BIBLIOGRAPHY

331

“greater” (respectively “smaller”) than it, that is, y = xj and z = xk where both j and k are greater (respectively, smaller) than i. Without loss of generality let j > k (respectively, j < k); that is, once sorted, the data stored in y must be greater (respectively smaller) than the data stored in z. Among the data initially stored at z, include the largest data item D[N ] (respectively the smallest item D[1]). For the data to be sorted, this item must move from z = xk to y = xj , passing through xi . However, as k > i (respectively k < i), according to the protocol z will never send D[N] (respectively D[1]) to xi . Answer to Exercise 5.6.39. If the storage requirement is invariant sized, then ki = |Dπ(i) |, which is known to xπ(i) . If the requirement is equidistributed, then the entities need to know N/n; both n and N, if not already known can be easily acquired (e.g., using saturation on a spanning-tree). Then, ki = N/n for 1 ≤ i ≤ n − 1. If the storage requirement is compacted with parameter w, then ki = w for 1 ≤ i ≤ N/w, while ki = 0 for i > N/w. Again, knowing N allows each entity to know what the size of its ﬁnal set of data items. Answer to Exercise 5.6.49. to xk all the data items that must end up Observe that if π(j ) = k, then to transfer there requires the transmission of βj →k = nj=1 |Di,j | dG (xi , xk ) messages. Deﬁne variables zj,k to be equal to 1 if π(j ) = k, 0 otherwise. Then minimization of e ex pression 5.28 reduces to ﬁnding a 0 − 1 solution for the linear programming assignment problem: Minimize g[Z] = n k=1 n j =1

n n j =1 k=1

βj →k zj,k

zj,k = 1 (1 ≤ j ≤ n) zj,k = 1 (1 ≤ k ≤ n)

zj,k ≥ 0 (1 ≤ j, k ≤ n).

A single entity can solve this problem in O(n3 ) local processing activities once the βj →k ’s are available at that entity.

BIBLIOGRAPHY [1] K.E. Batcher. Sorting networks and their applications. In AFIPS Spring Joint Computer Conference, pages 307–314, 1968. [2] To-Yat Cheung. An algorithm with decentralized control for sorting ﬁles in a network. Journal of Parallel and Distributed Computing, 7(3):464–481, 1989.

332

DISTRIBUTED SET OPERATIONS

[3] F. Chin and H.F. Ting. An improved algorithm for ﬁnding the median distributively. Algorithmica, 2:235–249, 1987. [4] G.N. Frederickson. Distributed algorithms for selection in sets. Journal of Computing and System Science, 37(3):337–348, 1988. [5] O. Gerstel, Y. Mansour, and S. Zaks. Bit complexity of order statistics on a distributed star network. Information Processing Letters, 30(3):127–132, 1989. [6] O. Gerstel and S. Zaks. The bit complexity of distributed sorting. Algorithmica, 18: 405–416, 1997. [7] H.P. Hofstee, A.J. Martin, and J.L.A. van de Snepscheut. Distributed sorting. Science of Computer Programming, 15(2–3):119–133, 1990. [8] M.C. Loui. The complexity of sorting on distributed systems. Information and Control, 60:70–85, 1984. [9] W.S. Luk and Franky Ling. An analytical/empirical study of distributed sorting on a local area network. IEEE Transactions on Software Engineering, 15(5):575–586, 1989. [10] S.L. Mantzaris. An improved algorithm for ﬁnding the median distributively. Algorithmica, 10(6):501–504, 1993. [11] J.M. Marberg and E. Gafni. Distributed sorting algorithms for multi-channel broadcast networks. Theoretical Computer Science, 52(3):193–203, 1987. [12] A. Negro, N. Santoro, and J. Urrutia. Efﬁcient distributed selection with bounded messages. IEEE Transaction on Parallel and Distributed Systems, 8:397–401, 1997. [13] E.J. Otoo, N. Santoro, and D. Rotem. Improving semi-joint evaluation in distributed query processing. In 7th International Conference on Distributed Computing Systems., pages 554–561, sept 1987. [14] M. Rodeh. Finding the median distributively. Journal of Computing and Systems Science, 24(2):162–167, 1982. [15] D. Rotem, N. Santoro, and J. B. Sidney. Distributed sorting. IEEE Transaction on Computers, 34:372–376, 1985. [16] D. Rotem, N. Santoro, and J.B. Sidney. Shout-echo selection in distributed ﬁles. Networks, 16:77–86, 1986. [17] N. Santoro, M. Scheutzow, and J.B. Sidney. On the expected complexity of distributed selection. Journal of Parallel and Distributed Computing, 5:194–203, 1988. [18] N. Santoro and J.B. Sidney. Order statistics on distributed sets. In 20th Allerton Conf. on Communication, Control and Computing, pages 251–256, 1982. [19] N. Santoro and E. Suen. Reduction techniques for selection in a distributed ﬁle. IEEE Transactions on Computers, 38(6):891–896, 1989. [20] H. Shi and J. Schaeffer. Parallel sorting by regular sampling. Journal of Parallel and Distributed Computing, 14(4):361–372, 1992. [21] L. Shrira, N. Francez, and M. Rodeh. Distributed k-selection: From a sequential to a distributed algorithm. In 2nd ACM Sympsium on Principles of Distributed Computing, pages 143–153, 1983. [22] L.M. Wegner. Sorting a distributed ﬁle in a network. Computer Networks, 8(5/6):451–462, December 1984. [23] S. Zaks. Optimal distributed algorithms for sorting and ranking. IEEE Transactions on Computers, 34:376–380, 1985.

CHAPTER 6

Synchronous Computations

6.1

SYNCHRONOUS DISTRIBUTED COMPUTING

6.1.1 Fully Synchronous Systems In the distributed computing environments we have considered so far, we have not made any assumption about time. In fact, from the model, we know only that in absence of failure, a message transmitted by an entity will eventually arrive to its neighbor: the Finite Delays axiom. Nothing else is speciﬁed, so we do not know for example how much time will a communication take. In our environment, each entity is endowed with a local clock; still no assumption is made on the functioning of these clocks, their rate, and how they relate to each other or to communication delays. For these reasons, the distributed computing environments described by the basic model are commonly referred to as fully asynchronous systems. They represent one extreme in the spectrum of message-passing systems with respect to time. As soon as we add temporal restrictions, making assumptions on the the local clocks and/or communication delays, we describe different systems within this spectrum. At the other extreme are fully synchronous systems, distributed computing environments where there are strong assumptions both on the local clocks and on communication delays. These systems are deﬁned by the following two restrictions about time: Synchronized Clocks and Bounded Transmission Delays. Restriction 6.1.1 Synchronized Clocks All local clocks are incremented by one unit simultaneously. In other words, all local clocks ‘tick’ simultaneously. Notice that this assumption does not mean that the clocks have the same value, but just that their value is incremented at the same time. Further notice that the interval of time between consecutive increments in general need not be constant. For simplicity, in the following we will assume that this is the case and denote by δ the constant; see Figure 6.1.

Design and Analysis of Distributed Algorithms, by Nicola Santoro Copyright © 2007 John Wiley & Sons, Inc.

333

334

SYNCHRONOUS COMPUTATIONS

FIGURE 6.1: In a fully synchronous system, all clocks tick periodically and simultaneously, and there is a known upperbound ⌬ on communication delays.

By Convention, 1. entities will transmit messages (if needed) to their neighbors only at the strike of a clock tick; 2. at each clock tick, an entity will send at most one message to the same neighbor. Restriction 6.1.2 Bounded Communication Delays There exists a known upper bound on the communication delays experienced by a message in absence of failures. In other words, there is a constant ⌬ such that in absence of failures, every message sent at time T will arrive and be processed by time T + ⌬. In terms of clock ticks, this means that in absence of failures, every message sent at local clock tick t will arrive and be processed by clock tick t + ⌬ δ (sender’s time); see Figure 6.1. Summarizing, a fully synchronous system is a distributed computing environment where both the above restrictions hold. Notice that knowledge of ⌬ can be replaced by knowledge of ⌬ δ . 6.1.2 Clocks and Unit of Time In a fully synchronous system, two consecutive clock ticks constitute a unit of time, and we measure the time costs of a computation in terms of the number of clock ticks elapsed from the time the ﬁrst entity starts the computation to the time the last entity terminates its participation in the computation. Notice that, in this “clock time,” there is an underlying notion of “real time” (or physical time), one that exists outside the system (and independent of it), in terms of which we express the distance δ between clock ticks as well as the bound ⌬ on communication delays. We can redeﬁne the unit of time to be composed of u > 1 consecutive clock ticks. In other words, we can deﬁne new clock ticks, each comprising u old ones, and act accordingly. In particular, each entity will only send messages at the beginning of

SYNCHRONOUS DISTRIBUTED COMPUTING

335

FIGURE 6.2: Redeﬁne the clock ticks so that the delays are unitary.

a new time unit and does not send more than one message to the same neighbor in each new time unit. Clearly, the entities must agree on when the new time unit starts. After the transformation, we can still measure time costs of a computation correctly: If the execution of a protocol lasts K new time units, its time cost is uK original clock ticks. Observe that if we choose u = ⌬ δ (Figure 6.2), then with the new clocks communication delays become unitary: If an entity x sends a message at the (new) local clock tick t to a neighbor, in absence of failures, the message is received and processed there at the (new) clock tick t + 1 (sender’s time). In other words, any fully synchronous system can be transformed so as to have unitary delays.

This means that we can assume, without loss of generality, that the following restriction holds: Restriction 6.1.3 Unitary Communication Delays In absence of failures, a transmitted message will arrive and be processed after at most one clock tick. The main advantage of doing this redeﬁnition of unit of time is that it greatly simpliﬁes the design and analysis of protocols for fully synchronous systems. In fact, it is common to ﬁnd fully synchronous systems deﬁned directly as having unitary delays. IMPORTANT. In the following, the pair of Restrictions 6.1.1 and 6.1.3, deﬁning a fully synchronous system with unitary delay, will be denoted simply by Synch.

336

SYNCHRONOUS COMPUTATIONS

6.1.3 Communication Delays and Size of Messages A fully synchronous system, by deﬁnition, guarantees that, in absence of failures, any allowed message will encounter bounded delays. More precisely, by deﬁnition, for any message M, the communication delay τ (M) encountered by M in absence of failures will always be τ (M) ≤ ⌬.

(6.1)

Notice that this must hold regardless of the size (i.e., the number of bits) of M. Let us examine this fact carefully. By Restriction 6.1.2, ⌬ is bounded. For ⌬ to be bounded τ (M) must be bounded. This fact implies that the size of M must be bounded: To assume otherwise means that the system allows communication of unbounded messages in bounded time, an impossibility. This means, Property 6.1.1 Bounded messages In fully synchronous systems, messages have bounded length. In other words, there exists a constant c (depending on the system) such that each message will contain at most c bits. Bounded messages are also called packets and the constant c is called packet size. IMPORTANT. The packet size c is a system parameter. It could be related to other system parameters such as n (the network size) or m (the number of links). However, it cannot depend on input values (unless they are also bounded). The bounded messages property has important practical consequences. It implies that if the information an entity x must transmit does not ﬁt in a packet, that information must be “split up” and transmitted using several packets. More precisely, the transmission of w > c bits to a neighbor actually requires the transmission of M[w] messages where M[w] ≥ w c . This fact affects not only the message costs but also the time costs. As at most one message can be sent to a neighbor at a given clock tick, the number of clock ticks required by the transmission of w > c bits is CT[w] ≥ w c . 6.1.4 On the Unique Nature of Synchronous Computations Fully synchronous computing environments are dramatically different from the asynchronous ones we have considered so far. The difference is radical and provides 1

that is, it goes to the roots

SYNCHRONOUS DISTRIBUTED COMPUTING

337

the protocol designer working in a fully synchronous environment with computational means and tools that are both unique and very powerful. In the following we will brieﬂy describe two situations providing an insight in the unique nature of synchronous computations. Overcoming Lower Bounds: Different Speeds As a ﬁrst example of a synchronous algorithm, we will discuss a protocol for leader election in synchronous rings. We assume the standard restrictions for elections (IR), as well as Synch; the goal is to elect as leader the candidate with the smallest value. The protocol is essentially AsFar with an interesting new idea. Recall that in AsFar each entity originates a message with its own id, forwards only messages with the smallest id seen so far, and trashes all the other incoming messages. The message with the smallest value will never be trashed; hence it will make a full tour of the ring and return to its originator; every other message will be trashed by the ﬁrst entity with a smaller id it encounters. We have seen that this protocol has an optimal message complexity on the average but uses O(n2 ) messages in the worst case. The interesting new idea is to have each message travel along the ring at a different speed, proportional to the id it contains, so that messages with smaller ids travel faster than those with larger values. In this way, a message with a small id can “catch up” with a slower message (containing a larger id); when this happens, the message with the larger id will be trashed. In other words, a message with a large id is trashed not only if it reaches an entity aware of a smaller id but also if it is reached by a message with a smaller id. However, in a synchronous system, every message transmission will take at most one time unit; so, in a sense, all messages travel at the same speed. How can we implement variable speeds in a synchronous system? The answer is simple: (a) When an entity x receives a message with a value i smaller than any seen so far by x, instead of immediately forwarding the message along the ring (as the protocol AsFar would require), x will hold this message for an amount of time (i.e., a number of clock ticks) f (i) directly proportional to the value i. (b) If a message with a smaller value arrives at x during this time, x will remove i from consideration and process the new value. Otherwise, after holding i for f (i) clock ticks, x will forward it along the ring. The effect is that a message with value i will be effectively traveling along the ring at speed 1 + f (i): If originally sent at time 0, it will be sent at time 1 + f (i) to the next entity, and again at time 2 + 2f (i), 3 + 3f (i), and so on, until it is trashed or completes the tour of the ring. In this simple way, a we have implemented both variable speeds and the “catch-up” of slow messages by faster ones! The correctness of this new protocol follows from the fact that again, the message with the smallest id will never be trashed and will thus return to its originator; every

338

SYNCHRONOUS COMPUTATIONS

other message will be trashed either because of arriving to an entity that has seen a smaller id or because of being reached by a message with a smaller id. To determine the cost of the protocol, called Speed, obviously we must take care of several implementation details (variables, bookkeeping, start, speed, etc.), but the basic mechanism is there. Let us assume for the moment that all entities are initially candidates and start at the same time. For every choice of the monotonically increasing speed function f we will obtain a different cost. In particular, by choosing f (i) = 2i , we have a very interesting situation. In fact, by the time (the message with) the smallest id i1 has traveled all along the ring causing n transmissions, the second smallest i2 could have traveled at most halfway the ring causing n/2 transmissions, the third smallest could have traveled at most n/4, and in general the j th smallest could have traveled at most distance 2jn−1 . In other words, with this choice of speed function, the total number of transmissions until the entity with smallest value becomes leader is n j =1

n 2j −1

< 2n.

As the protocol will just need an additional n messages for the ﬁnal notiﬁcation, we have M[Speed] = O(n).

(6.2)

This result is remarkable: This message complexity is lower than the ⍀(n log n) lowerbound for leader election in asynchronous rings ! It clearly shows a fundamental complexity difference between synchronous and asynchronous systems. To achieve this result, we have used time directly as a computational tool: to implement the variable speeds of the messages and to select the appropriate waiting function f . The result must be further qualiﬁed; in fact, it is correct assuming that the entity values are small enough to ﬁt into a packet. In other words, it is correct but only if provided that the input values are bounded by 2c ; we will denote this additional restriction on the size of the input by InputSize(2c ). To have a better understanding of the amount of transmissions, we can measure the number of bits: B[Speed] = O(n log i),

(6.3)

where i is the range of the input values. We have assumed that all entities start at the same time. This assumption is not essential: It sufﬁces that we ﬁrst perform a wake-up, and elect a leader only among

SYNCHRONOUS DISTRIBUTED COMPUTING

339

PROTOCOL Speed

States: S = {ASLEEP, CANDIDATE, RELAYER, FOLLOWER, LEADER}; SINIT = {ASLEEP}; STERM = {FOLLOWER, LEADER}.

Restrictions: RI ∪ Synch ∪ Ring ∪ InputSize(2c ). ASLEEP Spontaneously begin min:= id(x); send("FindMin", min) to right; become CANDIDATE; end Receiving("FindMin", id ) begin min:= id ; send("FindMin", min) to other; become RELAYER; end CANDIDATE Receiving("FindMin", id ) begin if id < min then PROCESS-MESSAGE; become RELAYER else if id = id(x) then send(Notify) to other; become LEADER endif; endif end W hen(c(x) = alarm) begin send("FindMin", min) to direction; end Receiving(Notify) begin send(Notify) to other; become FOLLOWER; end

FIGURE 6.3: Protocol Speed.

the spontaneous initiators (i.e., the others will not originate a message but will still actively participate in the trashing and waiting processes). The election messages themselves can act as “wake-up” messages, traveling at normal (i.e., unitary) speed until they reach the ﬁrst spontaneous initiator, and only then traveling at the assigned speed. In this way, we still obtain a O(n) message complexity (Exercise 6.6.3).

340

SYNCHRONOUS COMPUTATIONS

RELAYER

Receiving("FindMin", id ) begin if id < min then PROCESS-MESSAGE; endif end W hen(c(x) = alarm) begin send("FindMin", min) to direction; end Receiving(Notify) begin send(Notify) to other; become FOLLOWER; end

Procedure PROCESS-MESSAGE begin min:= id ; direction:= sender; set alarm:= c(x) + f (id*); end

FIGURE 6.4: Rule for Relayer and Procedure Process-Message used by protocol Speed.

The modiﬁed protocol Speed is shown in Figures 6.3 and 6.4; c(x) denotes the local clock of the entity x executing the protocol, and W hen denotes the external event of the alarm clock ringing. Beyond the Scenes The results expressed by Equations 6.2 and 6.3 do not tell the whole story. If we calculate the time consumed by protocol Speed we ﬁnd (Exercise 6.6.4) that T[Speed] = O(n2i ).

(6.4)

In other words, the time is exponential. It is actually worse than it sounds. In fact, it is exponential not in n (a system parameter) but in the range i of the input values. Overcoming Transmission Costs: 2-bit Communication We have seen how, in a synchronous environment, the lowerbounds established for asynchronous problems do not necessarily hold. This is because of the additional computational power of fully synchronous systems.

SYNCHRONOUS DISTRIBUTED COMPUTING

341

FIGURE 6.5: Entity x sends only two packets.

The most clear and (yet) surprising example of the difference between synchronous and asynchronous environments is the one we will discuss now. Consider an entity x that wants to communicate to a neighbor y some information, unknown to y. Recall that in a fully synchronous If I system messages are bounded: w packets and therefore at least want to transmit w bits, I will have to send w c c time units or clock ticks. Still, x can communicate the information to y transmitting only two packets (!), regardless of the packet size (!!) and regardless of the information (!!!), provided it is ﬁnite. Property 6.1.2 In absence of failures, any ﬁnite sequence of bits can be communicated transmitting two messages, regardless of the message size. Let us see how this extraordinary result is possible. Let α be the sequence of bits that x wants to communicate to y; let 1α be the sequence α preﬁxed by the bit 1 (e.g., if α = 011, then 1α = 1011. Let I (1α) denote the integer whose binary encoding is 1α; for example, T (1011) = 11. Consider now the following protocol: PROTOCOL TwoBits. 1. Entity x (see Figure 6.5): (a) it sends to y a message “Start-Counting”; (b) it waits for I (1α) clock ticks, and then (c) sends a message “Stop-Counting”. 2. Entity y (Figure 6.6) : (a) upon receiving the “Start-Counting” message, it records the current value c1 of the local clock; (b) upon receiving the “Start-Counting” message, it records the current value c2 of the local clock. Clearly c2 − c1 = I (1α), from which α can be reconstructed. As the message size is irrelevant and the string 1α is ﬁnite but arbitrary, the property states that in absence of failures, any ﬁnite amount of information can be communicated by transmitting just 2 bits!

342

SYNCHRONOUS COMPUTATIONS

FIGURE 6.6: Entity y can reconstruct the information.

IMPORTANT. In synchronous computing there is a difference between communication and transmission. In fact, unlike asynchronous systems where transmission of messages is the only way in which neighboring entities can communicate, in synchronous systems absence of transmission can be used to communicate information, as we have just seen. In other words, in synchronous systems silence is expressive. This is the radical difference between synchronous and asynchronous computing environments. We will investigate how to exploit it in our designs. Beyond the Scenes The property, as stated, is incomplete from a complexity point of view. In fact, in a synchronous system, time and transmission complexities are intrinsically related to a degree nonexistent in asynchronous systems. In the example above, the constant bit complexity is achieved at the cost of a time complexity that is exponential in the length of the sequence of bits to be communicated, In fact, x has to wait I (1α) time units, but 2|α| ≤ I (1α) ≤ 2|α|+1 − 1, where |α| denotes the size (i.e., the number of bits) of α. Once again, there is an exponential time cost to be paid for the the remarkable use of time. 6.1.5 The Cost of Synchronous Protocols In a fully synchronous system, time and transmission complexities are intrinsically related to a degree nonexistent in asynchronous systems. As we have discussed in the subsection “Beyond the Scenes” of Section 6.1.4, to say “we can solve the election in a ring with O(n) messages” or “we can communicate the Encyclopædia Britannica transmitting 2 bits” is correct but incomplete. We have been able to achieve those results because we have used time as a computational element; however, time must be charged, and the protocol must pay for it.

COMMUNICATORS, PIPELINE, AND TRANSFORMERS

343

In other words, the cost of a fully synchronous protocol is both time and transmissions. More precisely, the communication cost of a fully synchronous protocol P is a couple P, T, where P denotes the number of packets and T denotes the number of time units. We will more often use the number of bits B instead of P; thus, our common measure will be the couple Cost[P ] = B[P ], T[P ]. So, for example, the complexity of Protocol Speed is Cost[Speed(i)] = O(n log i), O(n2i ) and that of Protocol TwoBits is C[TwoBits(α)] = 2, O(2|α| ). Summarizing, the cost of a fully synchronous protocol is both time and bits. In general, we can trade off one for the other, transmitting more bits to use less time, or vice versa, depending on our design goals.

6.2 COMMUNICATORS, PIPELINE, AND TRANSFORMERS In a system of communicating entities, the most basic and fundamental problem is obviously the process of an entity, the sender efﬁciently and accurately communicating information to another entity, the receiver. If these two entities are neighbors, this problem is called Two-Party Communication (TPC) problem. In an asynchronous system, this problem has only one solution: The sender puts the information into messages and transmits those messages. In fully synchronous systems, as we have already observed, transmission of bits is not the only way of communicating information; for example, in a fault-free system, if no bit is received at local time t + 1, then none was transmitted at time t. Hence, absence of transmission, or silence, is detectable and can be used to convey information. In fact, there are many possible solutions to the Two-Party Communication problem, called communicators, each with different costs. We have already seen one, Protocol TwoBits. In this section we will examine the design of efﬁcient communicators. Owing to the basic nature of the process, the choice of a communicator will greatly affect the overall performance of the higher level protocols employed in the system. We will then discuss the problem of communicating information at a distance, that is, when the sender and the receiver are not neighbors. We will see how this and related problems can be efﬁciently solved using a technique well known in very large scale integration (VLSI) and parallel systems: pipeline. We will also examine the notion of asynchronous-to-synchronous transformer, a “compiler” that given in input an asynchronous protocol solving a problem P

344

SYNCHRONOUS COMPUTATIONS

FIGURE 6.7: For the sender, a quantum is the number of clock ticks between two successive transmissions; for the receiver, it is the interval between two successive arrivals.

generates an efﬁcient synchronous protocol solving P. Such a transformer is a useful tool to solve problems for which an asynchronous solution is already known. Communicators are an essential component of a transformer; in fact, as we will see, different communicators result in different costs for the generated synchronous protocol. This is one more reason to focus on the design of efﬁcient communicators. In the following, we will assume that no failure will occur, that is, we operate under restriction Total Reliability. 6.2.1 Two-Party Communication Consider the simple task of an entity, the sender, communicating information to a neighbor, the receiver. At each time unit, the sender can either transmit a packet or remain silent; a packet transmitted by the sender at time t will be received and processed by the receiver at time t + 1 (sender’s time). The interval of time between two successive transmissions by the sender is called a quantum of silence (or, simply, quantum); if there are no failures, the interval of time between the two arrivals will be the same for the receiver (see Figure 6.2.1). The quantum is zero if the packets are sent at two consecutive clock ticks. Thus, to communicate information, the sender can use not only the transmission of several packets, but also the quanta of silence between successive transmissions. For example, in the TwoBits protocol, the sender was using the transmission of two packets as well as the quantum of silence between them. In general, the transmission of k packets p0 , p1 , . . . , pk−1 deﬁnes k − 1 quanta q1 , q2 , . . . , qk−1 , where qi is the interval between the transmissions of pi−1 and pi , 1 ≤ i ≤ k − 1. The ordered sequence p0 : q1 : p1 : ... : qk−1 : pk−1 we will called communication sequence. Clearly, there are many different ways in which we can design a protocol for the two entities to communicate using transmissions and silence, depending on the value of k we choose, the content of the packets, the size c of the packets, and so forth. Each design will yield a different cost.

COMMUNICATORS, PIPELINE, AND TRANSFORMERS

345

The problem of performing this task is called the Two-Party Communication problem, and any solution protocol is called a communicator. A communicator must specify the operations of the sender and of the receiver. In particular, a communicator is composed of an encoding function, specifying how to encode the information into the communication sequence of packets and silence; a decoding function, specifying how to reconstruct the information from the communication sequence of packets and silence. Associated with any communicator are clearly two related cost measures: the total number of packets transmitted and the total number of clock ticks elapsed during the communication; as we will see, the study of the two-party communication problem in synchronous networks is really the study of the trade-off between time and transmissions. IMPORTANT. To simplify the discussion, in the following, we will consider that a packet contains just a single bit, that is, c = 1. Everything we will say is easily extendable to the case c > 1. 2-bit Communicators We have already seen the most well known communicator, Protocol TwoBits. This protocol, also known as C2 , belongs to a class of communicators called k-bit Communicators where the number of transmitted packets is a constant k ﬁxed a priori and known to both entities. In C2 , to communicate a positive integer i, the sender transmits two packets, b0 and b1 , waiting i time units between the two transmissions; the receiver computes the quantum of silence q1 between the two transmissions and decodes it as the information. In other words, the communication pattern is b0 : q1 : b1 . The encoding function is encode(i) = b0 : i : b1 and the decoding function is decode(b0 : q1 : b1 ) = q1 . Thus, the total amount of time from the time the sender starts the ﬁrst transmission to the time the receiver decodes the information is the quantum of silence plus the two time units used for transmitting the bits. Thus, the cost of the protocol is Cost[ C2 (i)] = 2, i + 2.

(6.5)

346

SYNCHRONOUS COMPUTATIONS

Hacking. We can improve the time complexity by exploiting the fact that the two transmitted bits b0 and b1 can be used to convey some information about i. In fact, it is possible to construct a communicator, called R2 , that communicates i transmitting 2 bits and only 2 + 4i time units (Exercise 6.6.6). Clearly, a better time complexity will be obtained if packets contain more than a single bit; that is, c > 1 (Exercise 6.6.7). 3-bit Communicators Let us examine what difference transmitting an extra packet has on the overall cost of communication. First of all, observe that with three packets b0 , b1 and b2 , we have two quanta of silence: the interval of time q1 between the transmission of b0 and b1 and the interval q2 between the transmission of b1 and b2 . In other words, the communication pattern is b0 : q1 : b1 : q2 : b2 . With this extra quantum √ to our disposal, consider the following strategy. If the sender could communicate i using a single quantum, the receiver can reconstruct i by squaring the received quantum, and the entire process will cost still 2 bits (to √delimit √ the quantum) but only i + 2 time ! The problem with this strategy is that i might not be an integer,√ while a quantum must be an integer. The sender can obviously use i , which is an integer, and the receiver can compute q12 , which, a quantum q1 = however, might be smaller than i. What the sender can do is to use the second quantum q2 to communicate how far q12 is from i, that is, q2 = i − q12 . In this way, the receiver is capable to reconstruct i: It simply computes q12 + q2 . In other words, the encoding function is encode(i) = b0 :

√ √ 2 i : b1 : i − i : b2 .

For example, encode(8, 425) = b0 : 91 : b1 : 144 : b2 . The decoding function is decode(b0 : q1 : b1 : q2 : b2 ) = q12 + q2 . The time required by this protocol is clearly q1 + q2 + 3; as x − we have q1 + q2 + 3 =

√ 2 √ x ≤2 x ,

√ √ 2 √ i +i− i + 3 ≤ 3 i + 3.

In other words, this protocol, called C3 , has sublinear time complexity. The resulting cost is √ Cost[C3 (i)] = 3 , 3 i + 3.

(6.6)

COMMUNICATORS, PIPELINE, AND TRANSFORMERS

347

FIGURE 6.8: Constructing the encoding of 33,703 when k = 5.

Hacking. We can improve the time complexity by exploiting the fact that the transmitted packets can be used to convey some information about i. In fact, it is possible to construct a communicator, called R3 , that communicates I transmitting 3 bits and √ only i + 3 time units (Exercise 6.6.8). Again, the more bits a packet contains, the better will be the time costs (Exercise 6.6.9). (2d + 1)-bit Communicators A solution protocol using k = 2d + 1 bits can be easily obtained extending the idea employed for k = 21 + 1 = 3. The encoding of i can be deﬁned recursively as follows: encoding (i) = b : E(I1 ) : b E(Ii ) =

E(I2i ) : b : E(I2i+1 ) if 1 < i < k − 1 quantum of length Ii if k − 1 ≤ i ≤ 2k − 3,

where √ I1 = i, I2i = Ii , and I2i+1 = Ii − I2i2 , and b is an arbitrary packet. So, for example, the encoding of i = 33, 703 when k = 5 is b 13 b 14 b 14 b 18 b (see Figure 6.8). To obtain i = I1 , the receiver will recursively compute Ii = I2i2 + I2i+1 . Exactly k − 1 quanta will be used, and k bits will be transmitted. The time costs will 1 be O(i k ) (Exercise 6.6.10). Optimal (k+1)-bit Communicators () When designing efﬁcient communicators, several questions arise naturally: How good are the communicators we have designed so far? In general, if we use k + 1 transmissions, what is the best time that can be achieved and which communicator will be able to achieve it? In this section we will answer these questions. We will design a general class of solution protocols and analyze their cost; we will then establish lower bounds and show that the proposed protocols achieve these bounds and are therefore optimal.

348

SYNCHRONOUS COMPUTATIONS

Our goal is now to design protocols that can communicate any positive integer I transmitting k + 1 packets and using as little time as possible. Observe that with k + 1 packets the communication sequence is b0 : q1 : b1 : q2 : b2 : . . . : qk : bk . We will ﬁrst of all make a distinction between protocols that do not care about the content of the transmitted protocols (like C2 and C3 ) and those (like R2 and R3 ) that use those packets to convey information about I . The ﬁrst class of protocols are able to tolerate the type of transmission failures called corruptions. In fact, they use packets only to delimit quanta; as it does not matter what the content of the packet is (but only that it is being transmitted), these protocols will work correctly even if the value of the bits in the packets is changed during transmission. We will call them as corruption-tolerant communicators. The second class exploits the content of the packets to convey information about I ; hence, if the value of just one of the bits is changed during transmission, the entire communication will become corrupted. In other words, these communicators need reliable transmission for their correctness. Clearly, the bounds and the optimal solution protocols are different for the two classes. We will consider the ﬁrst class in details; the second types of communicators will be brieﬂy sketched at the end. As before, we will consider for simplicity the case when a packet is composed of a single bit, that is c = 1; the results can be easily generalized to the case c > 1. Corruption-Tolerant Communication If transmissions are subject to corruptions, the value of the received packets cannot be relied upon, and so they are used only to delimit quanta. Hence, the only meaningful part of the communication sequence is the k−tuple of quanta q1 , q2 , . . . , qk . Thus, the (inﬁnite) set Qk of all possible k-tuples q1 , q2 , . . . , qk , where the qi are nonnegative integers, describes all the possible communication sequences. What we are going to do is to associate to each communication sequence Q[I ] ∈ Qk a different integer I . Then, if we want to communicate I , we will use the unique sequence of quanta described by Q[I ]. To achieve this goal we need a bijection between k-tuples and nonnegative integers. This is not difﬁcult to do; it is sufﬁcient to establish a total order among tuples as follows. Given two k-tuples Q = q1 , q2 , . . . , qk and Q = q1 , q2 , . . . , qk of positive integers, we say that Q < Q if 1. qi < i qi or

i

2. i qi = i qi and qj = qj for 1 ≤ j < l, and ql < ql for some index l, 1 ≤ l ≤ k + 1.

COMMUNICATORS, PIPELINE, AND TRANSFORMERS

349

I 0 1 2 3 4 5 6 7 8 9 10 Q[I] 0,0,0 0,0,1 0,1,0 1,0,0 0,0,2 0,1,1 0,2,0 1,0,1 1,1,0 2,0,0 0,0,3 11 12 13 14 15 16 17 18 19 20 21 22 0,1,2 0,2,1 0,3,0 1,0,2 1,1,1 1,2,0 2,0,1 2,1,0 3,0,0 0,0,4 0,1,3 0,2,2 23 24 25 26 27 28 29 30 31 32 33 34 0,3,1 0,4,0 1,0,3 1,1,2 1,2,1 1,3,0 2,0,2 2,1,1 2,2,0 3,0,1 3,1,0 4,0,0 FIGURE 6.9: The ﬁrst 35 elements of Q3 according to the total order.

That is, in this total order, all the tuples where the sum of the quanta is t are smaller than those where the sum is t + 1; so, for example 2, 0, 0 is smaller than 1, 1, 1. If the sum of the quanta is the same, the tuples are lexicographically ordered; so, for example, 1, 0, 2 is smaller than 1, 1, 1. The ordered list of the ﬁrst few elements of Q3 is shown in Figure 6.9. In this way, if we want to communicate integer I we will use the k-tuple Q whose rank (starting from 0) in this total order is I . So, for example, in Q3 , the triple 1, 0, 3 has rank 25, and the triple 0, 1, 4 corresponds to integer 36. The solution protocol, which we will call Orderk , thus uses the following encoding and decoding schemes. Protocol Orderk Encoding Scheme: Given I , the Sender (E1) ﬁnds Qk [I ] = a1 , a2 , . . . , ak ; (E2) it sets encoding(I ) := b0 : a1 : b1 : . . . , : ak : bk , where the bi are bits of arbitrary value. Decoding Scheme: Given (b0 : a1 : b1 : . . . , : ak : bk ), the receiver (D1) extracts Q = a1 , a2 , . . . , ak ; (D2) it ﬁnds I such that Qk [I ] = Q; (D3) it sets decoding(b0 : a1 : b1 : . . . , : ak : bk ): = I . The correctness of the protocol derives from the fact that the mapping we are using is a bijection. Let us examine the cost of protocol Orderk . The number of bits is clearly k + 1. B[Orderk ](I ) = k + 1.

(6.7)

What is the time? The communication sequence b0 : q1 : b1 : q2: b2 : . . . : qk : bk costs k + 1 time units spent to transmit the bits b0 , . . . , bk , plus ki=1 qi time

350

SYNCHRONOUS COMPUTATIONS

units of silence. Hence, to determine the time T [Orderk ](I ) we need to know the sum

t +k . of the quanta in Qk [I ]. Let f (I, k) be the smallest integer t such that I ≤ k Then (Exercise 6.6.12), T[Orderk ](I ) = f (I, k) + k + 1.

(6.8)

Optimality We are now going to show that protocol Orderk is optimal in the worst case. We will do so by establishing a lower bound on the amount of time required to solve the two-party communication problem using exactly k + 1 bit transmissions. Observe that k + 1 time units will be required by any solution algorithm to transmit the k + 1 bits; hence, the concern is on the amount of additional time required by the protocol. We will establish the lower bound assuming that the values I we want to transmit are from a ﬁnite set U of integers. This assumption makes the lower bound stronger because for inﬁnite sets, the bounds can only be worse. Without any loss of generality, we can assume that U = Zw = {0, 1, . . . , w − 1}, where |U | = w. Let c(w, k) denote the number of additional time units needed in the worst case to solve the two-party communication problem for Zw with k + 1 bits that can be corrupted during the communication. To derive a bound on c(w, k), we will consider the dual problem of determining the size ω(t, k) of the largest set for which the two-party communication problem can always be solved using k + 1 corruptible transmissions and at most t additional time units. Notice that with k + 1 bit transmissions, it is only possible to distinguish k quanta; hence, the dual problem can be rephrased as follows: Determine the largest positive integer w = ω(t, k) such that every x ∈ Zw can be communicated using k distinguished quanta whose total sum is at most t. This problem has an exact solution (Exercise 6.6.14): ω(t, k) =

t +k

k

.

(6.9)

This means that if U has size ω(t, k), then t additional time units are needed (in the worst case) by any communicator that uses k + 1 unreliable bits to communicate values of U . If the size of U is not precisely ω(t, k), we can still determine a bound. Let f (w, k) be the smallest integer t such that ω(t, k) ≥ w. Then c(w, k) = f (w, k).

(6.10)

COMMUNICATORS, PIPELINE, AND TRANSFORMERS

351

That is Theorem 6.2.1 Any corruption-tolerant solution protocol using k + 1 bits to communicate values from Zw requires f (w, k) + k + 1 time units in the worst case. In conjunction with Equation 6.8, this means that protocol Orderk is a worst case optimal. We can actually establish a lower bound on the average case as well (Exercise 6.6.15), and prove (Exercise 6.6.16) that protocol Orderk is average-case optimal Corruption-Free Communication () If bit transmissions are error free, the value of a received packet can be trusted. Hence it can be used to convey information about the value I the sender wants to communicate to the receiver. In this case, the entire communication sequence, bits and quanta, is meaningful. What we do is something similar to what we just did in the case of corruptible bits. We establish a total order on the set Wk of the 2k + 1 tuples b0 , q1 , b1 , q2 , b2 , . . . , qk , bk corresponding to all the possible communication sequences. In this way, each tuple 2k + 1-tuple W [i] ∈ Wk has associated a distinct integer: its rank i. Then, if we want to communicate I , we will use the communication sequence described by W [I ]. In the total order we choose, all the tuples where the sum of the quanta is t are smaller than those where the sum is t + 1; so, for example, in W2 , 1, 2, 1, 0, 1 is smaller than 0, 0, 0, 3, 0. If the sum of the quanta is the same, tuples (bits and quanta) are lexicographically ordered; so, for example, in W2 , 1, 1, 1, 1, 1 is smaller than 1, 2, 0, 0, 0. of The resulting protocol is called Order+k . Let us examine its costs. The number

t +k bits is clearly k + 1. Let g(I, k) be the smallest integer t such that I ≤ 2k+1 . k Then (Exercise 6.6.13), B[Order+k ](I ) = k + 1

(6.11)

T[Order+k ](I ) = g(I, k) + k + 1.

(6.12)

Also, protocol Order+k is worst-case and average-case optimal (see exercises 6.6.17, 6.6.18, and 6.6.19). Other Communicators The protocols Orderk and Order+k belong to the class of k + 1-bit communicators where the number of transmitted bits is ﬁxed a priori and known to both the entities. In this section, we consider arbitrary communicators, where the number of bits used in the transmission might not be not predetermined (e.g., it may change depending on the value I being transmitted).

352

SYNCHRONOUS COMPUTATIONS

With arbitrary communicators, the basic problem is obviously how the receiver can decide when a communication has ended. This can be achieved in many different ways, and several mechanisms are possible. Following are two classical ones: Bit Pattern. The sender uses a special pattern of bits to notify the end of communication. For example, the sender sets all bits to 0, except the last, which is set to 1; the drawback with this approach is that the bits cannot be used to convey information about I . Size Communication. As part of the communication, the sender communicates the total number of bits it will use. For example, the sender uses the ﬁrst quantum to communicate the number of bits it will use in this communication; the drawback of this approach is that the ﬁrst quantum cannot be used to convey information about I . We now show that, however ingenious the employed mechanism be, the results are not much better than those obtained just using optimal k + 1-bit communicators. In fact, an arbitrary communicator can only improve the worst-case complexity by an additive constant. This is true even if the receiver has access to an oracle revealing (at no cost) for each transmission the number of bits the sender will use in that transmission. Consider ﬁrst the case of corruptible transmissions. Let γ (t, b) denote the size of the largest set for which an oracle-based communicator uses at most b corruptible bits and at most t + b time units. Theorem 6.2.2 γ (t, b) < ω(t + 1, b) Proof. As up to k + 1 corruptible by Equation

6.9, bits can be transmitted,

t +j t +k+1 t +1+k k k γ (t, b) = j =1 ω(t, j ) = j =1 = −1< j k k = ω(t + 1, b). 䊏 This implies that, in the worst case, communicator Orderk requires at most one time unit more than any strategy of any type which uses the same maximum number of corruptible bits. Consider now the case of incorruptible transmissions. Let α(t, b) denote the size of the largest set for which an oracle-based communicator uses at most b reliable bits and at most t + b time units. To determine a bound on α(t, b), we will ﬁrst consider the size β(t, k) of the largest set for which a communicator without an oracle uses always at most b reliable bits and at most t + b time units. We know (Exercises 6.6.17) that

t +k k+1 Lemma 6.2.1 β(t, k) = 2 . k From this, we can now derive Theorem 6.2.3 α(t, b) < β(t + 1, b).

COMMUNICATORS, PIPELINE, AND TRANSFORMERS

353

Proof. As up to k + 1 incorruptible bits can be transmitted, α(t, b) = kj =1 β(t, j ).

t +j t +1+k k k j +1 k+1 I2 , x2 will ﬁnish waiting its value before this message arrives. In this case, x2 will wait until it receives “Stop-Counting” signal from x1 , and then forward it. Thus, the “Stop-Counting” signal will be sent to x3 at the correct time t + 1 + I1 = t + 1 + Max{I1 , I2 } = t . That is, x2 will always send Max{I1 , I2 } in time to x3 . The same reasoning we just used to understand how x2 can know Max{I1 , I2 } in time can be applied to verify that indeed each xj can know Max{I1 , I2 , . . . , Ij −1 } in time (Exercise 6.6.23). An example is shown in Figure 6.12. We have described the solution using TwoBits as the communicator. Clearly any communicator C can be used, provided that its encoding is monotonically increasing,

FIGURE 6.12: Time–Event diagram showing the computation of the largest value in pipeline.

COMMUNICATORS, PIPELINE, AND TRANSFORMERS

357

that is, if I > J , then in C the communication sequence for I is lexicographically smaller than that for J . Note that protocols Orderk and Order+k are not monotonically increasing; however, it is not difﬁcult to redeﬁne them so that they have such a property (Exercises 6.6.21 and 6.6.22). The total number of bits will then be (p − 1) Bits(C, Imax ),

(6.15)

the same as that without pipeline. The time instead is at most (p − 1) + Time(C, Imax ).

(6.16)

Once again, the number of bits is the same as that without pipeline; the time costs are instead greatly reduced: The factor (p − 1) is additive and not multiplicative. Similar reductions in time can be obtained for other computations, such as computing the minimum value (Exercise 6.6.24), the sum of the values (Exercise 6.6.25), and so forth. The approach we used for these computations in a chain can be generalized to arbitrary tree networks; see for example Problems 6.6.5 and 6.6.6. 6.2.3 Transformers Asynchronous-to-Synchronous Transformation The task of designing a fully synchronous solution for a problem can be easily accomplished if there is already a known asynchronous solution A for that problem. In fact, since A makes no assumptions on time, it will run under every timing condition, including the fully synchronous ones. Its cost in such a setting would be the number of messages M(A) and the “ideal” time T (A). Note that this presupposes that the size m(A) of the messages used by A is not greater than the packet size c (otherwise, the message must be broken into several packets, with a corresponding increasing message and time complexity). We can actually exploit the availability of an asynchronous solution protocol A in a more clever way and with a more efﬁcient performance than just running A in the fully synchronous system. In fact, it is possible to transform any asynchronous protocol A into an efﬁcient synchronous one S, and this transformation can be done automatically. This is achieved by an asynchronous-to-synchronous transformer (or just transformer), a “compiler” that, given in input an asynchronous protocol solving a problem P, generates an efﬁcient synchronous protocol solving P. The essential component of a transformer is the communicator. Let C be a universal communicator (i.e., a communicator that works for all positive integers). An asynchronous-to-synchronous transformer τ [C] is obtained as follows. Transformer τ [C] Given any asynchronous protocol A, replace the asynchronous transmission-reception of each message in A by the communication, using C, of the information contained in that message.

358

SYNCHRONOUS COMPUTATIONS

In other words, we replace each “send message” instruction in algorithm A by an instruction “communicate content of message,” where the communication is performed using the communicator C. It is not difﬁcult to verify that if A solves problem P for a class G of system topologies (i.e., graphs), then τ [C](A) = S is a fully synchronous protocol that solves P for the graphs in G. Note that in a practical implementation, we must take care of several details (e.g., overlapping arrival of messages) that we are not discussing here. Let us calculate now the cost of the obtained protocol S = τ [C](A) in a graph G ∈ G ; let M(A), Tcasual (A), and m(A) denote the message complexity, the causal time complexity, and the size of the largest message, respectively, of A in G. Recall that the causal time complexity is the length of the longest chain of causally related message transmissions over all possible executions. For some protocols, it might be difﬁcult to determine the causal time; however, we know that Tcasual (A) ≤ M(A); hence we always have an upperbound. In the transformation, the transmission (and corresponding reception) of I in A is replaced by the communication of I using communicator C; this communication requires Time(C, I ) time and Packets(C, I ) packets. As at most Tcasual (A) messages must be sent sequentially (i.e., one after the other) and I ≤ 2m(A) , the total number of clock ticks required by S will be Time(S) ≤ Tcasual (A) × Time(C, 2m(A) ).

(6.17)

As the information of each of the M(A) messages must be communicated, and the messages have size at most m(A), the total number of packets P(S) transmitted by the synchronous protocol S is just P(S) ≤ M(A) × Packets(C, m(A)).

(6.18)

In other words, Lemma 6.2.2 Transformation Lemma For every universal communicator C there exists an asynchronous-to-synchronous transformer τ [C]. Furthermore, for every asynchronous protocol A, the packet-time cost of τ [C](A) is at most Cost[ τ [C](A) ] ≤ M(A) Packets(C, m(A)) , Tcasual (A) Time(C, 2m(A) ). This simple transformation mechanism might appear to yield inefﬁcient solutions for the synchronous case. To dispel this false appearance, we will consider an interesting application. Application: Election in a Synchronous Ring Consider the problem of electing a leader in a synchronous ring. We assume the standard restrictions for elections (IR), as well as Synch. We have seen several efﬁcient election algorithms for asynchronous ring networks in previous chapters. Let us choose one and examine the effects of the transformer.

COMMUNICATORS, PIPELINE, AND TRANSFORMERS

359

Consider protocol Stages. Recall that this protocol uses M(Stages) = 2n log n + O(n); each message contains a value; hence, m(Stages) = log i, where i is the range of the input values; regarding the causal time, as Tcasual (A) ≤ M(A) for every protocol A, we have Tcasual (Stages) ≤ 2n log n + O(n). To apply the Transformation Lemma, we need to choose a universal communicator. Let us choose a not very efﬁcient one: TwoBits; recall that the cost of communicating integer I is 2 bits and I + 2 time units. Let us now apply the transformation lemma. We then have a new election protocol SynchStages= τ [TwoBits](Stages) for synchronous ring; as Time(TwoBits, 2m(Stages) ) = 2log i + 2 = i + 2, by Lemma 6.2.2, we have T(SynchStages) ≤ 2n log(n) (i + 2) + l.o.t

(6.19)

B(SynchStages) = 2M(Stages) ≤ 2n log(n) + O(n).

(6.20)

and

This result must be compared with the bounds of the election algorithm Speed speciﬁcally designed for synchronous systems (see Figure 6.13): The transformation lemma yields bounds that are order of magnitude better than those previously obtained by speciﬁcally designed algorithm. Once we have obtained a solution protocol using a transformer, both the bits and the time complexity of this solution depend on the communicator employed by the transformer. Sometimes, the time complexity can be further reduced without increasing the number of bits by using pipeline. For example, during every stage of protocol Stages and thus of protocol SynchStages, the information from each candidate must reach the neighboring candidate on each side. This operation, as we have already seen, can be efﬁciently done in pipeline, yielding a reduction in time costs (Exercise 6.6.26). Design Implications The transformation lemma gives a basis of comparison for designing efﬁcient synchronous solutions to problems for which there already exist asynchronous solutions. To improve on the bounds obtained by the use of the transformation lemma, it is necessary to more explicitly and cleverly exploit the availability of “time” as a computational tool. Some techniques that achieve this goal for some speciﬁc problems are described in the next sections. Protocol Speed SynchStages

Bits O(n log i) O(n log n)

Time O(2i n) O(i n log n)

FIGURE 6.13: The transformer yields a more efﬁcient ring election protocol

360

SYNCHRONOUS COMPUTATIONS

When designing a protocol, our aim must be to avoid the transmission of unbounded messages; in particular, if the input values are drawn from some unbounded universe (e.g., positive integers) and the goal of the computation is the evaluation of a function of the input values, then the messages cannot contain such values. For example, the “trick” on which the transformation lemma is based is an instance of a simple and direct way of exploiting time by counting it; in this case, the actual value is communicated but not transmitted. 6.3 MIN-FINDING AND ELECTION: WAITING AND GUESSING Our main goal as protocol designers is to exploit the fact that in synchronous systems, time is an explicit computational tool, so as to develop efﬁcient solutions for the assigned task or problem. Let us consider again two problems that we have extensively studied for asynchronous networks: minimum-ﬁnding and election. We assume the standard restrictions for minimum-ﬁnding (R), as well as Synch; in the case of election, we obviously assume Initial Distinct Values (ID) also. We have already seen a solution protocol, Speed, designed for synchronous ring networks; we have observed how its low message costs came at the expense of a time complexity that is exponential in the range of the input values. The Transformation Lemma provides a tool that automatically produces a synchronous solution when an asynchronous one is already available. We have seen how the use of a transform leads to an election protocol for rings, SynchStages, with reduced bits and time costs. By integrating pipeline, we can obtain further improvements. The cost of minimum-ﬁnding and election can be signiﬁcantly reduced by using other types of “temporal” tools and techniques. In this section, we will describe two basic techniques that make an explicit use of time, waiting and guessing. We will describe and use them to efﬁciently solve MinFinding and Election in rings and other networks. 6.3.1 Waiting Waiting is a technique that uses time not to transmit a value (as in the communicators), but to ensure that a desired condition is veriﬁed. Waiting in Rings Consider a ring network where each entity x has as initial value a positive integer id(x). Let us assume, for the moment, that the ring is unidirectional and that all entities start at the same time (i.e., simultaneous initiation). Let us further assume that the ring size n is known. The way of ﬁnding the minimum value using waiting is surprisingly simple. What an entity x will initially do is nothing, but just wait. More precisely, Waiting 1. The entity x waits for a certain amount of time f (id(x), n). 2. If nothing happens during this time, the entity determines “I am the smallest” and sends a “Stop” message.

MIN-FINDING AND ELECTION: WAITING AND GUESSING

361

3. If, instead, while waiting the entity receives a “Stop” message, it determines “I am not the smallest” and forwards the message. With the appropriate choice of the waiting function f , this surprisingly simple protocol works correctly! To make the process work correctly, the entities with the smallest value must ﬁnish waiting before anybody else does (in this way, each of them will correctly determine “I am the minimum”). In other words, the waiting function f must be monotonically decreasing: if id(x) < id(y) then f (id(x), n) < f (id(y, n)). This is, however, not sufﬁcient. In fact, it is also necessary that every entity whose value is not the smallest receives a “Stop” message while still waiting (in this way, each of them will correctly determine “I am not the minimum”). To achieve this, it is necessary that if x originates a “Stop” message, this message would reach every entity y with id(x) < id(y) while y is still waiting, that is, if id(x) < id(y), then f (id(x), n) + d(x, y) < f (id(y), n),

(6.21)

where d(x, y) denotes the distance of y from x in the ring. This must hold regardless of the distance d(x, y) and regardless of how small id(y) is (provided id(y) > id(x)). As d(x, y) ≤ n − 1 for every two entities in the ring, and the smallest value larger than id(x) is clearly id(x) + 1, any function f satisfying the following inequality

f (0)

=0

f (v, n) + n − 1 < f (v + 1, n)

(6.22)

will make protocol Wait function correctly. Such is, for example, the waiting function f (i, n) = i n.

(6.23)

As an example, consider the ring topology shown in Figure 6.14(a) where n = 6. The entities with the smallest value, 3, will ﬁnish waiting before all others: After 6 × 3 = 18 units of time they send a message along the ring. These messages travel along the ring encountering the other entities while they are still waiting, as shown in Figure 6.14(b). IMPORTANT. Protocol Wait solves the minimum-ﬁnding problem, not the election: Unless we assume initial distinct values, more than one entity might have the same smallest value, and they will all correctly determine that they are the minimum.

362

SYNCHRONOUS COMPUTATIONS

FIGURE 6.14: (a) The time when an entity x would ﬁnish waiting; (b) the messages send by the entities with value 3 at time 6 × 3 = 18 reach the other entities while they are still waiting.

As an example of execution of waiting under the (ID) restriction, consider the ring topology shown in Figure 6.15 where n = 6, and the values outside the nodes indicate how long each entity would wait. The unique entity with the smallest value, 3, will be elected after 6 × 3 = 18 units of time. Its “Stop” message travels along the ring encountering the other entities while they are still waiting.

FIGURE 6.15: Execution with Initial Distinct Values: a leader is elected.

MIN-FINDING AND ELECTION: WAITING AND GUESSING

Protocol Speed SynchStages Wait

Bits O(n log i) O(n log n) O(n)

Time O(2i n) O(i n log n) O(i n)

363

Notes

n known

FIGURE 6.16: Waiting yields a more efﬁcient ring election protocol

What is the cost of such a protocol? Only an entity that becomes minimum originates a message; this message will only travel along the ring (forwarded by the other entities that become large) until the next minimum entity. Hence the total number of messages is just n; as these messages are signals that do not contain any value, we have that Wait uses only O(n) bits. This is the least amount of transmissions possible ever. Let us consider the time. It will take f (imin , n) = imin n time units for the entities with the smallest value to decide that they are the minima; at most, n − 1 additional time units are needed to notify all others. Hence, the time is O(i, n), where i is the range of the input values. Compared with the other protocols we have seen for election in the ring, Speed and SynchStages, the bit complexity is even better (see Figure 6.16). Without Simultaneous Initiation We have derived this surprising result assuming that the entities start simultaneously. If the entities can start at any time, it is possible that an entity with a large value starts so much before the others that it will ﬁnish waiting before the others and incorrectly determine that it is the minimum. This problem can be taken care of by making sure that although the entities do not start at the same time, they will start not too far away (in time) from each other. To achieve this, it is sufﬁcient to perform a wake-up: When an entity spontaneously wants to start the protocol, it will ﬁrst of all send a “Start” message to its neighbor and then start waiting. An inactive entity will become active upon receiving the “Start” message, forward it, and start its waiting process. Let t(x) denote the time when entity x becomes awake and starts its waiting process; then, for any two entities x and y, ∀x, y t(y) − t(x) ≤ d(x, y);

(6.24)

in particular, no two entities will start more than n − 1 clock ticks off from each other. The waiting function f must now take into account this fact. As before, it is necessary that if id(x) < id(y), then x must ﬁnish waiting before y and its message should reach y while still waiting; but now this must happen regardless of at what time t(x) entity x starts and at what time t(y) entity y starts; that is, if id(x) < id(y), t(x) + f (id(x), n) + d(x, y) < t(y) + f (id(y), n).

(6.25)

364

SYNCHRONOUS COMPUTATIONS

As d(x, y) < n for every two entities in the ring, by Equation 6.24, and by setting f (0) = 0, it is easy to verify that any function f satisfying the inequality

f (0) =0 f (v, n) + 2n − 1 < f (v + 1, n)

(6.26)

will make protocol Wait function correctly even if the entities do not start simultaneously. Such is, for example, the waiting function f (v, n) = 2 n v.

(6.27)

The cost of the protocol is slightly bigger, but the order of magnitude is the same. In fact, in terms of bits we are performing also a wake-up that, in a unidirectional ring, costs n bits. As for the time, the new waiting function is just twice as the old one; hence, the time costs are at most doubled. In other words, the costs are still those indicated in Figure 6.16. In Bidirectional Rings We have considered unidirectional rings. If the ring is bidirectional, the protocol requires marginal modiﬁcations, as shown in Figure 6.17. The same costs as the unidirectional case can be achieved with the same waiting functions. On the Waiting Function We have assumed that the ring size n is known to the entities; it is indeed used in the requirements for waiting functions (Expressions 6.22 and 6.26). An interesting feature (Exercise 6.6.31) is that those requirements would work even if a quantity n is used instead of n, provided n ≥ n. Hence, it is sufﬁcient that the entities know (the same) upperbound n on the network size. If the entities have all available a value n that is, however, smaller than n, its use in a waiting function instead of n would in general lead to incorrect results. There is, however, a range of values for n that would still guarantee the desired result (Exercise 6.6.32). A ﬁnal interesting observation is the following. Consider the general case when the entities have available neither n nor a common value n, that is, each entity only knows its initial value id(x). In this case, if each entity uses in the waiting function its value id(x) instead of n, the function would work in some cases, for example, when all initial values id(x) are not smaller than n. See Exercise 6.6.33. Universal Waiting Protocol The waiting technique we have designed for rings is actually much more general and can be applied in any connected network G, regardless of its topology. It is thus a universal protocol. The overall structure is as follows: 1. First a reset is performed with message “Start.” 2. As soon as an entity x is active, it starts waiting f (id(x), n) time units.

MIN-FINDING AND ELECTION: WAITING AND GUESSING

365

PROTOCOL Wait

States: S = {ASLEEP, CANDIDATE, LARGE, MINIMUM}; SINIT = {ASLEEP}; STERM = {LARGE, SMALL}.

Restrictions: R ∪ Synch ∪ Ring ∪ Known(n). ASLEEP Spontaneously begin set alarm:= c(x) + f (id(x),n); send("Start") to right; direction := right; become CANDIDATE; end Receiving("Start") begin set alarm:= c(x) + f (id(x),n); send("Start") to other; direction := other; become CANDIDATE; end CANDIDATE W hen(c(x) = alarm) begin send("Over") to direction; become MINIMUM; end Receiving("Over") begin send("Over") to other; become LARGE; end

FIGURE 6.17: Protocol Wait.

3. If, nothing happens while x is waiting, x determines that “I am the minimum” and initiates a reset with message “Stop.” 4. If, instead, a “Stop” message arrives while x is waiting, then it stops its waiting, determines that “I am not the minimum” and participates in the reset with message “Stop.” Again, regardless of the initiation times, it is necessary that the entities with the smallest value ﬁnish waiting before the entities with larger value and that all those other entities receive a “Stop” message while still waiting. That is, if id(x) < id(y), then t(x) + f (id(x)) + dG (x, y) < t(y) + f (id(y)),

366

SYNCHRONOUS COMPUTATIONS

where dG (x, y) denotes the distance between x and y in G, and t(x) and t(y) are the times when x and y start waiting. Clearly, for all x, y, |t(x) − t(y)| ≤ dG (x, y); hence, setting f (0) = 0, we have that any function satisfying

f (0) =0 f (v) + 2dG < f (v + 1)

(6.28)

makes the protocol correct, where dG is the diameter of G. This means that, for example, the function f (v) = 2 v (dG + 1)

(6.29)

would work. As n − 1 ≥ dG for every G, this also means that the function f (v) = 2 v n we had determined for rings actually works in every network; it might not be the most efﬁcient though (Exercises 6.6.29 and 6.6.30). Applications of Waiting We will now consider two rather different applications of protocol Wait. The ﬁrst is to compute two basic Boolean functions, AND and OR; the second is to reduce the time costs of protocol Speed that we discussed earlier in this chapter. In both cases we will consider unidirectional ring for the discussion; the results, however, trivially generalize to all other networks. In discussing these applications, we will discover some interesting properties of the waiting function. Computing AND and OR Consider the situation where every entity x has a Boolean value b(x) ∈ {0, 1}, and we need to compute the AND of all those values. Assume as before that the size n of the ring is known. The AND of all the values will be 1 if and only if ∀x b(x) = 1, that is, all the values are 1; otherwise the result is 0. Thus, to compute AND it sufﬁces to know if there is at least one entity x with value b(x) = 0. In other words, we just need to know whether the smallest value is 0 or 1. With protocol Waiting we can determine the smallest value. Once this is done, the entities with such a value know the result. If the result of AND is 1, all the entities have value 1 and are in state minimum, and thus know the result. If the result of AND

MIN-FINDING AND ELECTION: WAITING AND GUESSING

367

is 0, the entities with value 0 are in state minimum (and thus know the result), while the others are in state large (and thus know the result). Notice that if an entity x has value b(x) = 0, using the waiting function of expression 6.27, its waiting time will be f (b(x)) = 2 b(x) n = 0. That is, if an entity has value 0, it does not wait at all. To determine the cost of the overall protocol is quite simple (Exercise 6.6.35). In a similar way we can use protocol Waiting to compute the OR of the input values (Exercise 6.6.36).

Reducing Time Costs of Speed The ﬁrst synchronous election protocol we have seen for ring networks is Speed, discussed in Section 6.1.4. (NOTE: to solve the election problem it assumes initial distinct values.) On the basis of the idea of messages traveling along the ring at different speeds, this protocol has unfortunately a terrifying time complexity: exponential in the (a priori unbounded) smallest input value imin (see Figure 6.16). Protocol Waiting has a much better complexity, but it requires knowledge of (an upperbound on) n; on the contrary, protocol Speed requires no such knowledge. It is possible to reduce the time costs of Speed substantially by adding Waiting as a preliminary phase. As each entity x knows only its value id(x), it will ﬁrst of all execute Waiting using 2id(x)2 as the waiting function. Depending on the relationship between the values and n, the Waiting protocol might work (Exercise 6.6.33), determining the unique minimum (and hence electing a leader). If it does not work (a situation that can be easily detected; see Exercise 6.6.34), the entities will then use Speed to elect a leader. The overall cost of this combine protocol Wait + Speed clearly depends on whether the initial Waiting succeeds in electing a leader or not. If Waiting succeeds, we will not execute Speed and the cost will just be O(i2min ) time and O(n) bits. If Waiting does not succeed, we must also run Speed that costs O(n) messages i ) time. So the total cost will be O(n) messages and O(i2 + n2imin ) = but O(n2min min O(n2imin ) time. However, if Waiting does not succeed, it is guaranteed that the smallest initial value is at most n, that is imin < n (see again Exercise 6.6.33). This means that the overall time cost will be only O(n2n ). In other words, whether the initial Waiting succeeds or not, protocol Wait+Speed will use O(n) messages. As for the time, it will cost either O(i2min ) or O(n2n ), depending on whether the waiting succeeds or not. Summarizing, using Waiting we can reduce the time complexity of Speed from O(n2i ) to O( Max{i2 , n2n } ) adding at most O(n) bits.

368

SYNCHRONOUS COMPUTATIONS

Application: Randomized Election If the assumption on the uniqueness of the identities does not hold, the election problem cannot be solved obviously by any minimum-ﬁnding process, including Wait. Furthermore, we have already seen that if the nodes have no identities (or, analogously, all have the same identity), then no deterministic solution exists for the election problem, duly renamed symmetry breaking problem, regardless of whether the network is synchronous or not. This impossibility result applies to deterministic protocols, that is, protocols where every action is composed only of deterministic operations. A different class of protocols are those where an entity can perform operations whose result is random, for example, tossing a dice, and where the nature of the action depends on outcome of this random event. For example, an entity can toss a coin and, depending on whether the result is “head” or “tail,” perform a different operation. These types of protocols will be called randomized; unlike their deterministic counterparts, randomized protocols give no guarantees, either on the correctness of their result or on the termination of their execution. So, for example, some randomized protocols always terminate but the solution is correct only with a given probability; this type of protocols is called Monte Carlo. Other protocols will have the correct solution if they terminate, but they terminate only with a given probability; this type of protocols are called Las Vegas. We will see how protocol Wait can be used to generate a surprisingly simple and extremely efﬁcient Las Vegas protocol for symmetry breaking. Again we assume that n is known. We will restrict the description to unidirectional rings; the results can, however, be generalized to several other topologies (Exercises 6.6.37-6.6.39). 1. The algorithm is composed of a sequence of rounds. 2. In each round, every entity randomly selects an integer between 0 and b as its identity, where b ≤ n. 3. If the minimum of the chosen values is unique, that entity will become leader; otherwise, a new round is started. To make the algorithm work, we need to design a mechanism to ﬁnd the minimum and detect if it is unique. But this is exactly what protocol Wait does. In fact, protocol Wait not only ﬁnds the minimum value but also allows an entity x with such a value to detect if it is the only one. In fact, – If x is the only minimum, its message will come back exactly after n time units; in this case, x will become leader and send a Terminate message to notify all other entities. – If there are more than one minimum, x will receive a message before n time units; it will then send a “Restart” message and start the next round. In other words, each round is an execution of protocol Wait; thus, it costs O(n) bits, including the “Restart” (or “Termination”) messages. The time used by protocol Wait is O(ni). In our case the values are integers between 0 and b, that is, i≤ b. Thus, each round will cost at most O(nb) time.

MIN-FINDING AND ELECTION: WAITING AND GUESSING

369

We have different options with regard to the value b and how the random choice of the identities is made. For example, we can set b = n and choose each value with same probability (Exercise 6.6.40); notice, however, that the larger the b is, the larger the time costs of each round will be. We will use instead b = 1 (i.e., each entity randomly chooses either 0 or 1) and employ a biased coin. Speciﬁcally, in our protocol, which we will call Symmetry, we will employ the following criteria: Random Selection Criteria In each round, every entity selects 0 with probability 1 , and 1 with probability n−1 n n . Up to now, except for the random selection criteria, there has been little difference between Symmetry and the deterministic protocols we have seen so far. This is going to change soon. Let us compute the number of rounds required by the protocol until termination. The surprising thing is that this protocol might never terminate, and thus the number of rounds is potentially inﬁnite. In fact, with a protocol of type Las Vegas, we know that if it terminates, it solves the problem, but it might not terminate. This is not a good news for those looking for protocols with a guaranteed performance. The advantage of this protocol is instead in the low expected number of rounds before termination. Let us compute this quantity. Using the random selection criteria described above, the protocol terminates as soon as exactly one entity chooses 0. For this to happen, one entity x must choose 0 (this happens with probability n1 ), while the other n − 1

n n−1 n−1 entities must choose 1 (this happen with probability ( n ) ). As there are =n 1 choices for x, the probability of exactly one entity chooses 0 is

n 1

1 n−1 n−1 n( n )

n−1 . = ( n−1 n )

For n large enough, this quantity is easily bounded; in fact lim

n→∞

n−1 n

n−1 =

1 , e

(6.30)

where e ≈ 2.7 . . . is the basis of the natural logarithm. This means that with probability 1, protocol Symmetry will terminate after e rounds. In other words, with probability 1, protocol Symmetry will elect a leader with O(n) bits in O(n) time. Obviously, there is no guarantee that a leader will be elected with this cost or will be elected at all, but with high probability it will and at that cost. This shows the unique nature of randomized protocols.

370

SYNCHRONOUS COMPUTATIONS

6.3.2 Guessing Guessing is a technique that allows some entities to determine a value not by transmitting it but by guessing it. Again we will consider the minimum ﬁnding and election problems in ring networks. Let us assume, for the moment, that the ring is unidirectional and that all entities start at the same time (i.e., simultaneous initiation). Let us further assume that the ring size n is known.

Minimum-Finding as a Guessing Game At the base of the guessing technique there is a basic utility protocol Decide(p), where p is a parameter available to all entities. Informally, protocol Decide(p) is as follows: Decide (p): Every entity x compares its value id(x) with the protocol parameter p. If id(x) ≤ p, x sends a message; otherwise, it will forward any received message. There are only two possible situations and outcomes: S1: All local values are greater than p; in this case, no messages will be transmitted: There will be “silence” in the system. S2: At least one entity x has id(x) ≤ p ; in this case, every entity will send and receive a message: There will be “noise” in the system. The goal of protocol Decide is to make all entities know in which of the two situations we are. Let us examine how an entity y can determine whether we are in situation S1 or S2. If id(y) ≤ p, then y knows immediately that we are in situation S2. However, if id(y) > p, then y does not know whether all the entities have values greater than p (situation S1) or some entities have a value smaller than or equal to p (situation S2). It does know that if we are in situation S2, it will eventually receive a message; by contrast, if we are in situation S1, no message will ever arrive. Clearly, to decide, y must wait; also clearly, it cannot wait forever. How long should y wait? The answer is simple: If a message was sent by an entity, say x, a message will arrive at y within at most d(x, y) < n time units from the time it was sent. Hence, if y does not receive any message in the ﬁrst n time units since the start, then none is coming and we are in situation S1. For this reason, n time units after the entities (simultaneously) start the execution of protocol Decide(p), all the entities can decide which situation (S1 or S2) has occurred. The full protocol is shown in Figure 6.18. IMPORTANT. Consider the execution of Decide(p). – If situation S1 occurs, it means that all the values, including imin = Min{id(x)}, are greater than p, that is, p < imin . We will say that p is an underestimate on imin . – If situation S2 occurs, it means that there are some values that are not greater than imin ; thus, p ≥ imin . We will say that p is an overestimate on imin .

MIN-FINDING AND ELECTION: WAITING AND GUESSING

371

SUBPROTOCOL Decide(p)

Input: positive integer p; States: S = {START, DECIDED, UNDECIDED}; SINIT = {START}; STERM = {DECIDED}.

Restrictions: R ∪ Synch ∪ Ring ∪ Known(n) ∪ Simultaneous Start. START Spontaneously begin set alarm:= c(x) + n; if id(x) ≤ v then decision:= high; send("High") to rigth; become DECIDED; else become UNDECIDED; endif end UNDECIDED Receiving("High") begin decision:= high; send("High") to other; become DECIDED; end W hen(c(x) = alarm) begin decision:= low; become DECIDED; end

FIGURE 6.18: SubProtocol Decide(p).

These observations are summarized in Figure 6.19. NOTE. The condition p = imin is also considered an overestimate. Using this fact, we can reformulate the minimum-ﬁnding problem in terms of a guessing game: Each entity is a player. The minimum value imin is a number, previously chosen and unknown to the player, that must be guessed. The player can ask question of type “Is the number greater than p?”

Situation S1 S2

Condition p < imin p ≥ imin

Name “underestimate” “overestimate”

Time n n

Bits 0 n

FIGURE 6.19: Results and costs of executing protocol Decide.

372

SYNCHRONOUS COMPUTATIONS

Each question corresponds to a simultaneous execution of Decide(p). Situations S1 and S2 correspond to a "YES" and a "NO" answer to the question, respectively. A guessing protocol will just specify which questions should be asked to discover imin . Initially, all entities choose the same initial guess p1 and simultaneously perform Decide(p1 ). After n time units, all entities will be aware of whether or not imin is greater

than p1 (situation S1 and situation S2, respectively). On the basis of the outcome, a new guess p2 will be chosen by all entities that will then simultaneously perform Decide(p2 ). In general, on the basis of the outcome of the execution of Decide(pi ), all entities will choose a new guess pi+1 . The process is repeated until the minimum value imin is unambiguously determined. Depending on which strategy is employed for choosing pi+1 given the outcome of Decide(pi ), different minimum-ﬁnding algorithms will result from this technique. Before examining how to best play (and win) the game, let us discuss the costs of asking a question, that is, of executing protocol Decide. Observe that the number of bits transmitted when executing Decide depends on the situation, S1 or S2, we are in. In fact in situation S1, no messages will be transmitted at all. By contrast, in situation S2, there will be exactly n messages; as the content of these messages is not important, they can just be single bits. Summarizing, If our guess is an overestimate, we will pay n bits; if it is an underestimate, it will cost nothing. As for the time costs, each execution of Decide will cost n time units regardless of whether it is an underestimate or overestimate. This means that we pay n time units for each question; however, we pay n bits only if our guess is an overestimate. See Figure 6.19. Our goal must, thus, be to discover the number, asking few questions (to minimize time) of which as few as possible are overestimates (to minimize transmission costs). As we will see, we will unfortunately have to trade off one cost for the other. We will ﬁrst consider a simpliﬁed version of the game, in which we know an upperbound M on the number to be guessed, that is, we know that imin ∈ [1, M] (see Figure 6.20). We will then see how to easily and efﬁciently establish such a bound. Playing the Game We will now investigate how to design a successful strategy for the guessing game. The number imin to be guessed is known to be in the interval [1, M] (see Figure 6.20). Let us denote by q the number of questions and by k ≤ q the number of overestimates used to solve the game; this will correspond to a minimum-ﬁnding protocol that uses qn time and kn bits. As each overestimate costs us n bits, to design an overall

FIGURE 6.20: Guessing in an interval.

MIN-FINDING AND ELECTION: WAITING AND GUESSING

373

FIGURE 6.21: Linear search is the only possibility when k = 1.

strategy that uses only O(n) bits in total (like we did with protocol Waiting), we must use only a constant (i.e., O(1)) number of overestimates; clearly, we want to use as few questions as possible. Let us ﬁrst solve the problem with k = 1, that is, we want to ﬁnd the minimum with only one overestimate. As the number (i.e., when p = imin ) is already an overestimate when we ﬁnd it, k = 1 means that we can never use as a guess a value greater than imin . For this setting, there is only one possible solution strategy, linear search: The guesses will be p1 = 1, p2 = 2, p3 = 3, · · · All these guesses will be underestimates; when we hit pimin , there will be our ﬁrst and only overestimate. See Figure 6.21. The number of questions will be exactly imin ; that is, in the worst case, the cost will be k = 1 ; q = M. Let us now allow one more overestimate, that is,√ k = 2. Several strategies are now possible. A solution is to partition the interval into M consecutive pieces of size √ M . (If M is not a perfect square, the last interval will be smaller than the others.) See ﬁgure 6.22. √ We will ﬁrst search sequentially among the points a1 = M − 1, a2 = √ 2 M − 2, · · · , until we hit an overestimate. At this point we know the interval where imin is. The second overestimate is then spent to ﬁnd imin inside that interval using sequential search (as in the case k = 1). In the worst case, we have to search all the aj and all of the last interval, that is, in the worst case the cost will be √ k = 2 ; q = 2 M. Notice that by allowing a single additional overestimate (i.e., using an additional n bits) we have been able to reduce the time costs from linear to sublinear. In other words, the trade-off between bits and time is not linear. It is easy to generalize this approach (Exercise 6.6.43) so as to ﬁnd imin with a worst-case cost of k ; q = k M 1/k .

FIGURE 6.22: Dividing the interval when k = 2.

374

SYNCHRONOUS COMPUTATIONS

IMPORTANT. Notice that the cost is a trade-off between questions and overestimates: The more overestimates we allow, the fewer questions we need to ask. Furthermore, the trade-off is nonlinear: The reduction in number of questions achieved by adding a single overestimate is rather dramatic. As every overestimate costs n bits, the total number of bits is O(n k). The total amount of time consumed with this approach is at most O(n k M 1/k ). The Optimal Solution We have just seen a solution strategy for our guessing game when the value to be guessed is in a known interval. How good is this strategy? In the case k = 1, there is only one possible solution strategy. However, for k > 1 several strategies and solutions are possible. Thus, as usual, to answer the above question we will establish a lower bound. Surprisingly, in this process, we will also ﬁnd the (one and only) optimal solution strategy. To establish a lower bound (and ﬁnd out if a solution is good) we need to answer the following question: Q1: What is the smallest number of questions q needed to always win the game in an interval of size M using no more than k overestimates? Instead of answering this question directly, we will “ﬂip its arguments” and formulate another question: Q2: With q questions of which at most k are overestimates, what is the largest M so that we can always win the game in an interval of that size ? We will answer this one. The answer will obviously depend on both q and k, that is, M will be some function h(q, k). Let us determine this function. Some things we already know. For example, if we allow only one overestimate (i.e., k = 1), the only solution strategy is linear search, that is, h(q, 1) = q.

(6.31)

On the contrary, if we allow every question to be an overestimate (i.e., k = q), then we can always win in a much larger interval, in fact (Exercise 6.6.44), h(q, q) = 2q − 1.

(6.32)

Before we proceed, let us summarize the problem we are facing: 1. We have at our disposal q questions of which only k can be overestimates. 2. We must always win. 3. We want to know the size h(q, k) of the largest interval in which this is possible.

MIN-FINDING AND ELECTION: WAITING AND GUESSING

375

FIGURE 6.23: If the initial guess p is an underestimate, the largest interval has size p + h(q − 1, k).

Whatever the strategy be, it must start with a question. Let p be this ﬁrst guess. There are two possibilities; this is either an underestimate or an overestimate. If p is an underestimate (i.e., imin > p), we are left with q − 1 questions, but we still have k overestimates at our disposal. Now, the largest interval in which we can always win with q − 1 questions of which k can be overestimates is h(q − 1, k). This means that if p is the ﬁrst question (Figure 6.23), the largest interval has size h(q, k) = p + h(q − 1, k). On the basis of this, it would seem that to make the interval as large as possible, we should choose our ﬁrst guess p to be as large as possible. However, we must take into account the possibility that our ﬁrst guess turns out to be an overestimate. If p is an overestimate, we have spent both one question and one overestimate; furthermore, we know that the number is in the interval [1, p]. This means that the initial guess p we make must guarantee that we always win in the interval [1, p] with q − 1 questions and k − 1 overestimates. Thus, the largest p can be p = h(q − 1, k − 1). This means that h(q, k) = h(q − 1, k) + h(q − 1, k − 1),

(6.33)

where the boundary conditions are those of expressions 6.31 and 6.32; see Figure 6.24. Solving this recurrence relation (Exercise 6.6.45), we obtain the unique solution h(q, k) =

j =0,k−1

q j

.

(6.34)

FIGURE 6.24: The initial guess p could be an overestimate; this cannot be larger than h(q − 1, k).

376

SYNCHRONOUS COMPUTATIONS

We have found the answer to question Q2. If we now “ﬂip the answer,” we can answer also question Q1 and determine a lower bound on q given M and k. In fact, if M = h(q, k), then the minimum number of questions to always win in [1, M] with at most k overestimates (our original problem) is precisely q. In general, the answer is the smallest q such that M ≤ h(q, k). IMPORTANT. In the process of ﬁnding a lower bound, we have actually found the (one and only) optimal solution strategy to guess in the interval [1, M] with at most k overestimates. Let us examine this strategy. Optimal Search Strategy mates:

To optimally search in [1, M] with at most k overesti-

1. use as a guess p = h(q − 1, k − 1), where q ≥ k is the smallest integer such that M ≤ h(q, k); 2. if p is an underestimate, then optimally search in [p + 1, M] with k overestimates; 3. if it is an overestimate, then optimally search in [1, p] with k − 1 overestimates. This strategy is guaranteed to use the fewest questions. Unbounded Interval We have found the optimal solution strategy using at most k overestimates but assuming that the interval in which imin lies is known. If this is not the case, we can always ﬁrst of all establish an upperbound on imin , thus determining an interval and then search in that interval. To bound the value imin , again we use guesses, g(1), g(2), g(3), . . ., where g : N → Z is a monotonically increasing function. The ﬁrst time we hit an overestimate, say with g(t), we know that g(t − 1) < imin ≤ g(t) and hence the interval to search is [g(t − 1) + 1, g(t)]. See Figure 6.25. This process requires exactly t questions and one overestimate. We are now left to guess imin in an interval of size M = ⌬(t) = g(t) − g(t − 1) + 1 with k − 1 overestimates. (Recall, we just spent one to determine the interval.) Using the optimal solution strategy, this can be done with h(⌬(t), k − 1) questions. The entire process will thus require at most t + h(⌬(t), k − 1) questions of which at most k are overestimates.

FIGURE 6.25: In an unbounded interval, we ﬁrst establish an upper bound on imin .

MIN-FINDING AND ELECTION: WAITING AND GUESSING

Protocol

Bits

Speed SynchStages Wait

O(n log i) O(n log n) O(n)

Guess

O(kn)

Time O(2i n) O( i n log n ) O( i n ) O( i1/k kn)

377

Notes

n known n known

FIGURE 6.26: Using k = O(1), Guessing is more efﬁcient than other election protocols.

Depending on which function g we use, we obtain different costs. For example, choosing g(j ) = 2j (i.e., doubling our guess at every step), t = log imin and ⌬(t) < imin . This means that the number of questions used by the entire process is at most log imin + h( imin , k − 1). Better performances are possible using different functions g; for example (Exercise 6.6.46), with k overestimates, it is possible to reduce the total number of questions to 2 h( imin , k) − 1. Recall that each question costs n time units and if it is an overestimate it also costs n bits. Thus, the complexity of the resulting minimum-ﬁnding protocol Guess becomes O(kn) bits and O(kn ik ). This means that for any ﬁxed k, the guessing approach yields an election protocol that is far more efﬁcient than the ones we have considered so far, as shown in Figure 6.26. Removing the Assumptions Knowledge of n We have assumed that n is known. This knowledge is used only in procedure Decide, employed as a timeout for those entities that do not know if a message will arrive. Clearly the procedure will work even if a quantity n ≥ n is used instead of n, provided. Hence, it is sufﬁcient that the entities know (the same) upperbound n on the network size. Network Topology We have described our protocol assuming that the network is a ring. However, the optimal search strategy for the guessing game is independent of the network topology. To be implemented, it requires subprotocol Decide(p) that has been described only for rings. This protocol can be made universal, and can thus work in every network, by simple modiﬁcations. In fact (Exercise 6.6.47), it sufﬁces: 1. to transform it into a reset with message “High” started by those entities with id(x) ≤ p; and 2. to use as the timeout an upperbound d on the diameter d of the network.

378

SYNCHRONOUS COMPUTATIONS

Notice that each question will now cost d time units. The number w of bits transmitted if the guess is an overestimate depends on the situation; it is, however, always bounded as follows: m ≤ w ≤ 2m. Simultaneous Start We have assumed that all entities start the ﬁrst execution of Decide simultaneously. This assumption can actually be removed by simply using a wake-up procedure at the very beginning (so to bound the delays between initiation times) and using a longer delay between successive guesses (Exercise 6.6.48). 6.3.3 Double Wait: Integrating Waiting and Guessing We have seen two basic techniques, Waiting and Guessing. Their use has led to bitoptimal and time-efﬁcient solutions for the minimum-ﬁnding and election problems; we have described them for ring networks, but we have seen that they are indeed universal. Their only drawback is that they require knowledge of n (or of some upperbound on the diameter d). In contrast, both Speed and SynchStages did not require such an a priori knowledge. If this knowledge is not available, it can, however, be acquired somehow during the computation. We are going to see now how this can be done using both waiting and guessing. We will focus solely on the election problem; thus, we will be operating under restrictions of initial distinct values. Once again, we will restrict the description to unidirectional ring networks. We also assume that all entities start within n − 1 time units from each other (e.g., they ﬁrst execute a wake-up). What we are going to do is to still use the waiting technique to ﬁnd the smallest value; as we do not know n (nor an upperbound on it), we are going to use the guessing strategy to discover an upperbound on n. Let us discuss it in some details. Overall Strategy Each entity is going to execute protocol Wait using a guess g(1) on n. We know that if g(1) ≥ n, then protocol Wait works (Exercise 6.6.31), that is, the entity with smallest value ﬁnishes waiting before all other entities, it becomes small, it sends a message, and its message reaches all other entities while they are still waiting. The problem occurs if g(1) < n; in fact, in this case, it is possible that two or more entities with different ids will stop waiting, become small, and send a message. If we are able to detect if g(1) < n, we can then restart with a different, larger guess g(2) > g(1). In general, if g(j − 1) fails (i.e., g(j − 1) < n), we can restart with a larger guess g(j ) > g(j − 1); this process will terminate as soon as g(j ) ≥ n. Consider now an entity x that in step j ﬁnishes waiting, becomes small, and sends a message. If g(j ) ≥ n, no other entity sends any message, so, after n time units, x receives its own message. By contrast, if g(j ) < n, several entities might become small and originate messages, each traveling along the ring until it reaches

MIN-FINDING AND ELECTION: WAITING AND GUESSING

379

a small entity; hence x would receive the message transmitted by some other entity. Summarizing, in the ﬁrst case, x receives its own message; in the second case, the message was originated by somebody else. Without knowing n, how can x know whether the received message is its own? Clearly, if each message contains the id of its originator, the problem is trivially solved. However, the number of bits transmitted by just having such a message traveling along the ring will be O(n log i), resulting in an unbounded quantity (see Figure 6.26). The answer is provided by understanding how transmission delays work in a synchronous ring. Consider the delay nx (j ) from the time x transmits its message to the time a message arrives at x. If x receives its own message, then nx (j ) = n. By contrast, if x receives the message of somebody else, this will happen before n time units. That is, nx (j ) < n. So what x needs to do is to verify whether or not nx (j ) = n. This can be done by employing the waiting technique again, using nx (j ) for n in the waiting function. If indeed nx (j ) = n, x will again ﬁnish waiting without receiving any message and send a new message, and this message will travel along the ring after exactly nx (j ) = n time units. If instead nx (j ) < n, as we will see, x will notice that something is wrong (i.e., it will receive a message while waiting, it will receive a message before nx (j ) time units, or it will receive no message nx (j ) time units after it sent one, etc.); in this case, it will start the (j + 1)th iteration. Informally the strategy, called DoubleWait, is as follows: Strategy DoubleWait: 1. Each entity will execute a ﬁrst Wait using the current guess g(j ) on the unknown n. Consider an entity x that ﬁnishes waiting without receiving any message. It will send a message “Wait1,” become testing, and wait for a message to arrive keeping track of the time. Let nx (j ) be the delay from when x sent its “Wait1” message to when x received one. If the guess was correct (i.e., g(j ) ≥ n > g(j − 1)), then this message would be the one it sent and nx (j ) = n. 2. If x notices something wrong (e.g., nx (j ) ≤ g(j − 1), or nx (j ) > g(j ), etc.), it will send a “Restart” message to make everybody restart with a new guess g(j + 1). 3. If x does not notice anything wrong, x will assume that indeed tx (j ) = n and will start a second Wait (with a different waiting function) to verify the guess. If the guess is correct, x is the only entity doing so; it should thus ﬁnish waiting without receiving any message. Furthermore, the message “Wait2” it sends now should arrive exactly after nx (j ) time units. 4. If x now notices something wrong (i.e., a message arrives while waiting; a message does not arrive exactly after nx (j ) time units), it will send a “Restart” message to make everybody start with a new guess g(j + 1).

380

SYNCHRONOUS COMPUTATIONS

5. Otherwise, x considers the guess veriﬁed, becomes the leader, and sends a “Terminate” message. 6. An entity receiving a “Wait1’ message while doing the ﬁrst Waiting will forward received messages and wait for either a “Restart” or “Terminate.” In the ﬁrst case it restarts with a new guess; in the second case, it becomes defeated. What we have to show now is that with the appropriate choice of waiting functions, it is impossible for an entity x to be fooled. That is, if x does not notice anything wrong in the ﬁrst and in the second waiting and becomes leader, then indeed the message x receives is its own and nobody else will become leader. Choosing the Waiting Functions What we have to do now is to choose the two waiting functions f and h so that it is impossible for an entity x to be fooled. In other words, it is impossible that the “Wait1” and “Wait2” messages x receives have actually been sent by somebody else, say y and z, and that by pure coincidence both these messages arrived nx (j ) time units after x sent its corresponding messages. IMPORTANT. These functions must satisfy the properties of waiting functions, that is, if g(j ) ≥ n, then for all u and v with id(u) < id(v), f (id(u), j ) + 2(n − 1) < f (id(v), j ) h(id(u), j ) + 2(n − 1) < h(id(v), j ). NOTE. We can assume that the entities start the current stage using guess g(j ) within n − 1 time units from each other; this is enforced in the ﬁrst stage by the initial wake-up, and in the successive stages by the “Reset” messages. To determine the waiting functions f and h we need, let us consider the situation in more details, and let us concentrate on x and see under what conditions it would be fooled. Denote by t(x, j ) the delay between the time the ﬁrst entity starts the j th iteration and the time x starts it. Entity x starts at time t(x, j ), waits f (id(x), j ) time, and then sends its “Wait1” message; it receives one at time t(x, j ) + f (id(x), j ) + nx (j ). Notice that to “fool” x, this “Wait1” message must have been sent by some other entity, y. This means that y must also have waited without receiving any message; thus it sent its message at time t(y, j ) + f (id(y), j ). This message arrives at x at time t(y, j ) + f (id(y), j ) + d(y, x),

MIN-FINDING AND ELECTION: WAITING AND GUESSING

381

where, as usual, d(y, x) is the distance from y to x. Hence, for x to be “fooled,” it must be t(x, j ) + f (id(x), j ) + nx (j ) = t(y, j ) + f (id(y), j ) + d(y, x).

(6.35)

Concentrate again on entity x. After it receives the “Wait1” message, x waits again for an additional h(id(x), j ) time units, and then it sends its “Wait2” message; it receives one after nx (j ) time units, that is, at time t(x, j ) + f (id(x), j ) + nx (j ) + h(v, j ) + nx (j ) = t(x, j ) + f (id(x), j ) + h(id(x), j ) + 2tx (j ). At this point it becomes leader and sends a “Terminate” message. If x has been fooled the ﬁrst time, then also message “Wait2” was sent by some other entity z. It is not difﬁcult to verify that if x has been fooled, then there is only one fooling entity, that is, y = z (Exercise 6.6.49). To have sent a “Wait2” message, y must have not noticed anything wrong (otherwise it would have set a “Reset” instead). This means that similarly to x, y received a “Wait1” message ny (j ) time units after it sent one, that is, at time t(y, j ) + f (id(y), j ) + ny (j ). It waited for another h(y, j ) time units and then sent the “Wait2” message; this message thus arrived at x at time t(y, j ) + f (id(y), j ) + ny (j ) + h(y, j ) + d(y, x). So, if x has been fooled, it must by accident happen that t(x, j ) + f (id(x), j ) + h(id(x), j ) + 2tx (j ) = t(y, j ) + f (id(y), j ) + ny (j ) + h(id(y), j ) + d(y, x).

(6.36)

Subtracting Equation 6.35 from Equation 6.36, we have h(id(x), j ) + nx (j ) = h(id(y), j ) + ny (j ).

(6.37)

Summarizing, x will be fooled if and only if the condition of Equation 6.37 occurs. Notice that this condition does not depend on the ﬁrst waiting function f but only on the second one h. What we have to do is to choose a waiting function h that makes the condition of Equation 6.37 impossible. For example, the function h(id(x), j ) = 2 g(j ) id(x) + g(j ) − nx (j ) is a correct waiting function and will cause Equation 6.37 to become id(x) = id(y).

(6.38)

382

SYNCHRONOUS COMPUTATIONS

As the identities are distinct (because of ID restriction), this means that x = y, that is, the messages x receives are its own. In other words, with this waiting function, nobody will be fooled. Summarizing, regardless of the waiting function f and of the monotonically increasing guessing function g, with the appropriate choice of the second waiting function h, protocol DoubleWait correctly elects a leader. (Exercises 6.6.50, 6.6.51, and 6.6.52.) The Cost of DoubleWait Now that we have established the correctness of the protocol, let us examine its costs. The protocol consists of a sequence of iterations. In iteration j , a guess g(j ) is made on the unknown ring size n. The terminating condition is simply g(j ) ≥ n; in this case, the entity with the smallest value becomes leader; in all other cases, a new iteration is started. The number of iterations j required by the protocol is easily determined. As the protocol terminates as soon as g(j ) ≥ n, j = g −1 (n),

(6.39)

where g −1 is the inverse of g, that is, j is the smallest positive integer j such that g(j ) ≥ n. In an iteration, the guess g(j ) is employed in the execution of a ﬁrst waiting, using waiting function f (x, j ). As a result, either a new iteration is started or a second waiting, using function h(x, j ), is executed; as a result of this other waiting, either the algorithm terminates or a new iteration is started, depending on whether or not g(j ) ≥ n. The overall cost of the protocol depends on the two waiting functions, f and h, as well as on the monotonically increasing function g : N → Z specifying the guesses. To determine the cost, we will ﬁrst examine the number of bits and then determine the time. As we will see, we will have available many choices and, again, we will be facing a trade-off between time and bits. Bits Each iteration consists of at most two executions of the waiting technique (with different waiting functions). Each iteration, except the last, will be aborted and a “Restart” message will signal the start of the next iteration. In other words, each iteration j ≤ j is started by a “Restart” (in the very ﬁrst one it acts as the wake-up); this costs exactly n signals. As part of the ﬁrst waiting, “Wait1” messages will be sent, for a total of n signals. In the worst case there will also be a second waiting with “Wait2” message, causing no more than n signals. Hence, each iteration except the last will cost at most 3n signals. The last iteration has also a “Terminate” message costing exactly n signals. Hence, the total number of bits transmitted by DoubleWait will be at most B[DoubleWait] = 3 c n j + c n = 3 c n g −1 (n) + c n,

383

MIN-FINDING AND ELECTION: WAITING AND GUESSING

where c = O(1) is the number of bits necessary to distinguish between the “Restart,” “Wait1,” “Wait2,” and “Terminate” messages. Time Consider now the time costs of DoubleWait. Obviously, the time complexity of an iteration is directly affected by the values of the waiting functions f and h, which are in turn affected by the value g(j ) they must necessarily use in their deﬁnition. The overall time complexity is also affected by the number of iterations j= g −1 (n) that depends on the choice of the function g. Let us ﬁrst of all choose the waiting functions f and h. The ones we select are f (id(x), j ) = 2 g(j ) id(x),

(6.40)

which is the standard waiting function when the entities do not start at the same time and where g(j ) is used instead of n; and h(id(x), j ) = 2 g(j ) id(x) + g(j ) − nx (j ),

(6.41)

which is the one that, we have already seen, makes “fooling” impossible. With these choices made, we can determine the amount of time the protocol uses until termination. In fact, it is immediate to verify (Exercise 6.6.53) that the number of time units till termination is less than T[DoubleWait] = 2(n − 1) + (4 imin + 2)

j j =1

g(j ).

Again, this quantity depends solely on the choice of the guessing function g.

Trade-offs: Choosing The Guessing Function The results we have obtained for the number of bits and the amount of time are expressed in terms of the guessing function g. This is the only parameter we have not yet chosen. Before we proceed, let us examine what is the impact of such a choice. The protocol terminates as soon as g(j ) ≥ n, that is, after j = g −1 (n) iterations. If we have a fast-growing function g, this will happen rather quickly, requiring few iterations. For example, if we choose g(j ) = 2 g(j − 1) (i.e., we double every time), then j = log n ; we could choose something faster, say g(j ) = g(j − 1)2 (i.e., we square every time) obtainingj = log log n , or g(j ) = 2g(j −1) (i.e., we exponentiate every time) obtaining j = log n , where log denotes the number of times you must take a log before the value becomes 1. So it would seem that to reduce the bit complexity, we need f to grow as fast as possible. By contrast, the value g(j ) is a factor in the time complexity. In particular, the larger is g(j ), the more we have to wait. To understand how bad this impact can be, consider just the very last iteration j and assume that we just missed n, that is g(j − 1) = n − 1. In this last iteration we wait for roughly 4 id(x) g(j) = 4 id(x) g(g −1 (n)) time units.

384

SYNCHRONOUS COMPUTATIONS

g(j ) g(j ) = 2g(j − 1) g(j ) = g(j − 1)2 g(j ) = 2g(j −1)

Bits O(n log n) O(n log log n) O(n log n)

Time O(n i) O(n2 i) O(2n i)

FIGURE 6.27: Some of the trade-offs offered by the choice of g in DoubleWait.

This does not appear to be too bad; after all, g(g −1 (n)) = n. How much bigger than n can g(g −1 (n)) be ? It depends on how fast g grows. If we choose g(j ) = 2 g(j − 1), then g(g −1 (n)) = 2 (n − 1). However, if we choose g(j ) = g(j − 1)2 , then we have g(g −1 (n)) = (n − 1)2 , and the choice g(j ) = 2g(j −1) would give us g(g −1 (n)) = 2(n−1) . Thus clearly, from the time-complexity point of view, we want a function g that does not grow very fast at all. To help us in the decisional process, let us restrict to a class of functions. A function g is called superincreasing if for all j > 1

g(j ) ≥

j −1

g(s).

(6.42)

s=1

If we restrict ourselves to superincreasing functions, then the bit and time costs of DoubleWait become (Exercise 6.6.54) B[DoubleWait] ≤ 3 c n g −1 (n) + c n

(6.43)

T[DoubleWait] ≤ 2(n − 1) + (8 imin + 2) g g −1 (n) .

(6.44)

These bounds show the existence and the nature of the trade-off between time and bits. Some interesting choices are shown in Figure 6.27. Examining the trade-off, we discover two important features of protocol DoubleWait: 1. the bit complexity is always independent of the entities values and, thus, bounded; 2. the time complexity is always linear in the smallest entity value. Comparing the cost of Double Wait with the cost of the other ring election protocols that do not require knowledge of (an upperbound on) n, it is clear that DoubleWait outperforms Speed that has an unbounded bit complexity and a time complexity exponential in the input values. As for SynchStages, notice that by choosing g(j ) = 2g(j − 1), DoubleWait has the same bit costs but a better time complexity (see Figure 6.28); with a different choice of g, it is possible to have the same time of SynchStages but with a smaller bit complexity (Exercise 6.6.55).

SYNCHRONIZATION PROBLEMS: RESET, UNISON, AND FIRING SQUAD

Protocol

Bits

Time

Speed SynchStages DoubleWait

O(n log i) O(n log n) O(n g −1 (n))

O(n 2i ) O(n log n i) O(g(g −1 (n)) i)

Wait Guess

O(n) O(kn)

O(n i) O(k n i1/k )

n known n known

Symmetry

O(n)

O(n)

n known; randomized

385

Notes

FIGURE 6.28: Summary of Election techniques for synchronous rings.

Notice that the bit complexity can be asymptotically reduced to O(n), matching the one obtained by the protocols, Wait and Guess that assume knowledge of an upperbound on n; clearly this is achieved at the expense of an exorbitant time complexity. An exact O(n) bit complexity with a reasonable time can, however, be achieved without knowing n using DoubleWait in conjunction with other techniques (Problem 6.6.9). 6.4 SYNCHRONIZATION PROBLEMS: RESET, UNISON, AND FIRING SQUAD A fully synchronous system is by deﬁnition highly synchronized, so it might appear strange to talk about the need for synchronization in the system and the computational problems related to it. Regardless of the oddity, the need and the problems exist and are quite important. There is ﬁrst of all a synchronization problem related to the local clocks themselves. We know that in a synchronous environment all local clocks tick at the same time; however, they might not sign the same value. A synchronous system is said to be in unison if indeed all the clock values are the same. Notice that once a system is in unison, it will remain so unless the values of some clocks are locally altered. The unison problem is how to achieve such a state, possibly with several independent initiators. Then there two synchronization problems related to the computational states of the entities. The ﬁrst of them we have already seen, the wake-up or reset problem: All entities must enter a special state (e.g., awake); the process can be started by any number of entities independently. Notice that in this speciﬁcation there is no mention of when an individual entity must enter such a state; in fact, in the solutions we have seen, entities become awake at different times. Also, in the ﬁring squad problem all entities must enter a special state (usually called ﬁring), but they must do so at the same time and for the ﬁrst time. Firing squad synchronization is obviously stronger than reset. It is also stronger than unison: With unison, all entities arrive at a point where they are operating with the same clock value, and thus, in a sense, they are in the same “state” at the same time; however, the entities do not necessarily know when.

386

SYNCHRONOUS COMPUTATIONS

We are going to consider all three problems and examine their nature and interplay in some details. All of them will be considered under the standard set of restriction R plus obviously Synch. 6.4.1 Reset/Wake-Up In reset, all entities must enter the same state within ﬁnite time. One important application of reset is when a distributed protocol is only initiated by a subset of the entities in the system, and we need all entities in the system to eventually begin executing the protocol. When reset is applied at the ﬁrst step of a protocol, it is called wake-up. The wake-up or reset problem is a fundamental problem and we have extensively examined in asynchronous systems. In fully synchronous systems it is sometimes also called weak unison; its solution is usually a preliminary step in larger computations (e.g, Wait, Guess, DoubleWait), and it is mostly used to keep the initiation times of the main computation bounded. For example, in protocol Wait applied to a network G (not necessarily a ring) of known diameter d, the initial wake-up ensures that all entities become awake within d time units from the start. For computations that use wake-up as a tool, their cost obviously depends on the cost of the wake-up. Consider for example electing a leader in a complete graph Kn using the waiting technique. Not counting the wake-up, the election will cost only n − 1 bits, and it can be done in 4imin + 1 time units (see Equation 6.29); recall that in a complete graph, d = 1. Also, the wake-up can be done fast, in 1 time unit, but this can cost O(n2 ) bits. In other words, the dominant bit cost in the entire election protocol is the one of the wake-up, and it is unbearably high. Sometimes it is desirable to obtain wake-up protocols that are slower but use fewer transmissions. In the rest of this section we will concentrate on the problem of wake-up in a complete network. The difﬁculty of waking up in asynchronous complete networks, which we discussed in Section 2.2, does not disappear in synchronous complete networks. In fact, in complete networks where the port numbers are arbitrary, ⍀(n2 ) signals must be sent in the worst case. Theorem 6.4.1 In a synchronous complete network with arbitrary labeling, wake-up requires ⍀(n2 ) messages in the worst case. To see why this is true, consider any wake-up protocol W that works for any complete networks regardless of the labeling. By contradiction, let W use o(n2 ) signals in every complete network of size n. We will ﬁrst consider a complete network Kn1 with chordal labeling: A Hamiltonian cycle is identiﬁed, and a link (x, y) is labeled with the distance from x to y according to that cycle. The links incident on x will, thus, be labeled 1, 2, . . . , n − 1. On this network, we will consider the following execution: E1 : Every entity starts the wake-up simultaneously.

SYNCHRONIZATION PROBLEMS: RESET, UNISON, AND FIRING SQUAD

387

Concentrate on an entity x; let L(x) be the set of port numbers on which a message was sent or received by x during this execution. Observe that because all entities start at the same time and because of the symmetry of the labeling, L(x) = L(y) for all entities x and y. In fact, if x sends a signal via port number j , so will everybody else, and all of them will receive it from port number n − j . As protocol W is correct, within a ﬁnite number t of time time units, all the entities terminate. As, by assumption, every execution uses only o(n2 ) signals, |L(x)| = l = o(n). We construct now a complete network Kn2 with a different labeling. In this network, we select l + 1 entities x0 , x1 , . . . , xl , and label the links between them with a “almost chordal” labeling using the labels in L(x). All others links in the network are labeled arbitrarily without violating local orientation (this can always be done: Exercises 6.6.57 and 6.6.58). In this network consider the following execution: E2 : Only the selected entities will start and will do so simultaneously. In this execution only few (|L(x)| + 1 = o(n)) entities start. From the point of view of these initiators, everything in this execution happens exactly as if they were in the other execution in the other network: Messages will be sent and received exactly from the same ports in the same steps in both executions. In particular, none of them will send a signal outside its “little clique.” Hence, none of the other nodes will receive any signal; as those entities did not wake up spontaneously, this means that none of them will wake up at all. In particular, none of them will send any signal to the initiators; hence no initiator will receive a signal from outside the “little clique.” Therefore, the initiators will act as if they are in Kn1 and the execution is E1 ; thus, at time t the initiators will all terminate the execution of the protocol. However, the majority of the nodes is not awake, nor will it ever become awake, contradicting the correctness of the protocol. In other words, there is no correct wake-up protocol for the complete networks that will always require less than O(n2 ) transmissions. Summarizing, regardless of the protocol and the techniques (e.g., communicator, pipeline, waiting, guessing, etc.), and regardless of the fact that we can use time as a computational tool, wake-up will cost ⍀(n2 ) signals in the worst case. 6.4.2 Unison A synchronous system is said to be in unison if all the clock values are the same. The unison problem is how to achieve such a state, possibly with several independent initiators. Notice that once a system is in unison, it will remain so unless the values of some clocks are locally altered. Let us examine a very simple protocol for achieving unison. Each entity will execute a sequence of stages, each taking one unit of time, starting either spontaneously or upon receiving a message from another entity. Protocol MaxWave: 1. An initiator x starts by sending to all its neighbors the value of its local clock cx .

388

SYNCHRONOUS COMPUTATIONS

2. A noninitiator y starts upon receiving messages from neighbors: It increases those values by one time unit, computes the largest among these values and its own clock value, resets its clock to such a maximum, and sends it to all its neighbors. 3. In stage j > 1, an entity (initiator or not) checks the clock values it receives from its neighbors and increases each one of them by one time unit; it then compares these values with each other as well as with its own. If the value of the local clock is maximum, no message is sent; else, the local clock is set to the largest of all values, and this value is sent to all the neighbors (that sent a smaller value). Consider the largest value tmax among the local clocks when the protocol starts. It is not difﬁcult to see that this value (increased by one unit at each instant of time) reaches every entity, and every entity will set its local clock to such a time value (Exercise 6.6.59). In other words, with this simple protocol, that we shall call MaxWave, the entities are guaranteed to operate in unison within ﬁnite time. Let us discuss how long this process takes. Unison happens as soon as every entity whose initial clock value was smaller than tmax receives tmax (properly incremented). In the worst case, only one entity z has tmax at the beginning, and this entity is the last one to start. This value (properly incremented) has to reach every other entity in the network; this propagation will require at most a number of time units equal to the diameter d of the network; as z will start at most d time units after the ﬁrst entity, this means that the system operates in unison after at most 2d time units from the start. How can an entity detect termination ? How does it know whether the system is now operating in unison ? Necessarily, an entity must know d (or an upperbound on d, e.g., n) to be able to know when the protocol is over. The amount 2d is from the (global) time t the ﬁrst entities started the execution of the protocol. An entity x starts participating at some (global) time t(x) ≥ t. Thus, assuming that (an upperbound on) d is known a priori to all entities, at time t(x) + 2d entity x knows for sure that the system is operating in unison. (this time can actually be reduced; see Exercise 6.6.60). In other words, entities may terminate at different times; their termination will, however, be within at most d time units from each other. What is the number of messages that will be transmitted? A very rough overestimate is easily obtained by assuming that each entity x transmits to all its |N (x)| neighbors in each of the 2d time units; this gives 2d

x

|N (x)| = 4 d m.

This is a gross overestimate. In fact, once an entity receives the max time, it will transmit only in this step and no more. So the entities with the largest value will transmit to their neighbors only once; their neighbors will transmit only twice; in general, the entities at distance j from the entities with the largest value will transmit

SYNCHRONIZATION PROBLEMS: RESET, UNISON, AND FIRING SQUAD

389

only j + 1 time. We also know that an entity does not send the max time to those neighbors from which it received it. The actual cost depends on the topology of the network and the actual initiation times. For some networks, the cost is not difﬁcult to determine (Exercises 6.6.61 and 6.6.62). Assuming that we are operating not on an arbitrary graph but on a tree (e.g., a previously constructed spanning tree of the network), we immediately have m = n − 1; we can make accurate measurements (Exercise 6.6.63). In all this discussion, we have made an implicit assumption that the clock values we are sending are bounded and ﬁt inside a message. However, time and thus the clock values are unbounded. In fact, clock values increase at each time unit; in our protocol, the transmitted values were increased at each time unit and the largest was propagated. Therefore, the solution we have described is not feasible. To ensure that the values are bounded, we concentrate on the deﬁnition of the problem: Our goal is to achieve unison, that is, we want all local clocks to sign the same value. Notice that the deﬁnition does not care for what that value is, but only for that it is the same for all entities. Armed with this understanding, we make a very simple modiﬁcation to the MaxWave protocol: When an entity starts MaxWave, it ﬁrst resets its local clock to 0. In this way, the maximum value transmitted is at most 2d (Exercise 6.6.64), which is bounded. 6.4.3 Firing Squad Firing squad synchronization is a problem stricter than unison. It requires that all entities enter a predeﬁned special state, ﬁring, for the ﬁrst time simultaneously. More precisely, all the entities are initially in active state, and each active entity can at any time spontaneously become excited. The goal is to coordinate the entities so that, within ﬁnite time from the time the ﬁrst entity becomes excited, all entities become ﬁring simultaneously and for the ﬁrst time. In its original form, the problem was described for synchronous cellular automata (i.e., computational entities with O(1) memory) placed in a line of unknown length n, and where the leftmost entity in the line is the sole initiator, known as the “general”. Note that as cellular automata only have a constant memory size, they cannot represent (nor count up to) nonconstant values such as n or d. We are interested in solving this problem in our setting, where the entities have at least O(logn) bits of local memory, and thus they can count up to n. Again we are looking for a protocol that can work in any network; observe that the entities need to know or to compute (an upperbound on) d to terminate. If the network is a tree, or we have available a spanning tree of the network, then a simple efﬁcient solution exists, on the basis of saturation (Exercise 6.6.68). This protocol uses at most 3n − 2 signals and n − 2 messages each containing a value of at most d, for a total of O(n log n) bits; the time is at most 3d − 3. The bit complexity can be reduced to O(n) still using only O(n) time (Exercise 6.6.69). That is, ﬁring

390

SYNCHRONOUS COMPUTATIONS

squad can be solved in networks with an available spanning tree in optimal time and bits. What happens if there is no spanning tree available? Even worse, what happens if no spanning tree is constructible (e.g., in anonymous network)? The problem can still be solved. To do so, let us explore the relationship between ﬁring squad and unison. First observe that as all entities become ﬁring simultaneously, if each entity resets its local clock when it becomes ﬁring, all local clocks will have the same value 0 at the same time. In other words, any solution to the ﬁring squad problem will also solve the unison problem. The converse is not necessarily true. In unison, all the local clocks will at some point sign the same value; however, the entities might not know exactly when this happens. They might become aware (i.e., terminate) at different times; but for ﬁring squad synchronization we need that they make a decision simultaneously, that is, with no difference in time. Surprisingly, protocol MaxWave actually solves the ﬁring squad problem in networks where no spanning tree is available. To see why this is true, consider the modiﬁcation we made to ensure that the transmitted values are bounded: When an entity starts the protocol, it ﬁrst resets its local clock to 0. Let t be the global time when the protocol starts, that is, t is the time when the ﬁrst entities rest their clock to 0. We will call such entities “initiators.” Two simple observations (Exercises 6.6.70 and 6.6.71): Property 6.4.1 1. If a message originated by an initiator reaches entity y at time t + w, then the value of that message (incremented by 1) is exactly w. 2. Regardless of whether y has already independently started or starts now, the current value of its local clock will be smaller than w; thus, y will set its clock in unison with the clocks of the initiators. Summarizing, every noninitiator receives a message from the initiators, and as soon as an entity receives a message originated by the initiators (i.e., carrying the max reset time), it will become in unison with the initiators. Thus, an entity x is in unison with the initiators at time t + d(x, I ), where d(x, I ) denotes the distance between x and the closest initiator. As d(x, I ) ≤ d, this means that all clocks will be in unison after at most d time units from the start. Once the clocks are in unison, unless someone resets them, they keep on being in unison. As nobody is resetting the clocks again, this means that all entities will be in unison at time t + d. The value of the clocks at that time is exactly d. This means that when the reset local clock signs time d, the entity knows that indeed the entire system is in unison; if the entity enters state ﬁring at this time, it

BIBLIOGRAPHICAL NOTES

391

is guaranteed that all other entities will do the same simultaneously, and for the ﬁrst time, solving the ﬁring squad problem. Summarizing, protocol MaxWave solves the ﬁring squad problem in d time units: T[MaxWave] = d,

(6.45)

and this is worst-case optimal. The number of messages is less than 2 d m and each contains at most log d bit, that is, B[MaxWave] < 2 m d log d.

(6.46)

The bit complexity can be reduced at the expense of time, by using communicators to communicate the content of the messages (Exercises 6.6.66 and 6.6.67). 6.5 BIBLIOGRAPHICAL NOTES Some of the work on synchronous computing was done very early on in the context of Cellular Automata and Systolic Arrays; in particular, pipeline is a common computational tool in VLSI systems (which include systolic arrays). In the framework of distributed computing, the ﬁrst important result on (faultfree) synchronous computations is protocol Speed designed by Greg Frederickson and Nancy Lynch [9], and independently by Paul Vitanyi [26] (whose version of the protocol actually works with a weaker form of full synchrony, called Archimedean Time Assumption or ATA). This result has alerted algorithmic researchers to the existence of the ﬁeld. Some of the ﬁrst improvements were due to Eli Gafni [11] and Alberto Marchetti-Spaccamela [17], who reduced the time but still kept the unbounded bit complexity. Subsequent improvements to bounded bit complexity and to reduced time costs were obtained by using (and combining) communicators, waiting and guessing. Communicators have been used for a while. The so-called “one-bit” protocol (e.g., see Problem 6.6.1) was originally proposed and used by Hagit Attiya, Marc Snir, and Manfred Warmuth [3] and later rediscovered by Amotz Bar-Noi, Joseph Naor, and Moni Naor [4]. The size communicator is due to Bernd Schmeltz [24]. C2 is “folk” knowledge, while C3 is due to Paul Vitanyi [unpublished]. The optimal kcommunicators have been designed by Una-May O’Reilly and Nicola Santoro [20]. The ﬁrst combined use of communicators and pipeline is due to B. Schmeltz [24]. The computations in trees using pipeline are due to Paola Flocchini [8]. The asynchronous-to-synchronous transform is due to Una-May O’Reilly and Nicola Santoro [19]. The waiting technique was independently discovered by Eli Gafni [11], who used it to reduce the time costs of Speed, and by Nicola Santoro and Doron Rotem [23], who designed protocol Wait. Protocol Guess has been designed by Jan van Leeuwen, Nicola Santoro, Jorge Urrutia, and Shmuel Zaks [16]. Double Waiting is due to Mark Overmars and Nicola Santoro [21].

392

SYNCHRONOUS COMPUTATIONS

The ﬁrst bit-optimal election protocol for rings is due to Hans Bodlaender and Gerard Tel [5]; it does, however, require exponential time. The time has been subsequently drastically reduced (Problem 6.6.9) without increasing the bit complexity by Mark Overmars and Nicola Santoro [21]. The problem of symmetry breaking was ﬁrst studied for rings by Alon Itai and Michael Rodeh [14] and for other networks by Doron Rotem and Nicola Santoro [23]. The simpler and more efﬁcient protocol Symmetry has been designed by Greg Frederickson and Nicola Santoro [10]. These results have been extended to environments with ATA-synchrony by Paul Spirakis and Basil Tampakas [25]. The maximumﬁnding protocol for rings of Problem 6.6.7 has been designed by Paola Alimonti, Paola Flocchini, and Nicola Santoro [1]. The trade-offs for wake-up in complete graphs with chordal labeling are due to Amos Israeli, Evangelos Kranakis, Danny Krizanc, and Nicola Santoro [13]. The unison problem has been ﬁrst studied (in a slightly different context) by Shimon Even and Sergio Rajsbaum [6, 7], and in the context of self-stabilization by Mohamed Gouda and Ted Herman [12]. Bounding the message size was studied by Anish Arora, Shlomi Dolev, and Mohamed Gouda [2], always in the context of self-stabilization. The ﬁring squad problem was originally proposed for Cellular Automata by J. Myhill and reported by E. Moore [18]. In our context, the problem was ﬁrst studied for synchronous trees by Raul Ramirez and Nicola Santoro [22]; the optimal solution has been designed by Ephraim Korach, Doron Rotem, and Nicola Santoro [15]. The universal protocol MaxWave is a simple extension of existing unison solutions.

6.6 EXERCISES, PROBLEMS, AND ANSWERS 6.6.1 Exercises Exercise 6.6.1 Determine the number of messages of protocol Speed if the waiting function is f (v) = cv , for an integer c > 2. Exercise 6.6.2 Determine the number of messages of protocol Speed if the waiting function is f (v) = vc , for an integer c > 1. Exercise 6.6.3 Modify protocol Speed so that even if the entities do not start simultaneously, a leader is elected with O(n) messages. Exercise 6.6.4 Prove that Protocol Speed requires 2i n time units. Exercise 6.6.5 Modify protocol C2 so that it communicates any integer i, positive or negative, transmitting 2 bits and O(|i|) time units. Exercise 6.6.6 Construct a protocol R2 that communicates any positive integer I transmitting 2 bits and only 2 + I4 time units.

EXERCISES, PROBLEMS, AND ANSWERS

393

Exercise 6.6.7 Consider protocol TwoBits when each packet contains c > 1 bits. Use the content of the packets to convey information about the value i to be communicated. Determine the time costs that can be achieved. Exercise 6.6.8 Construct a √ protocol R3 that communicates any positive integer I transmitting 3 bits and only I + 3 time units. Exercise 6.6.9 Consider a system where packets contain c > 1 bits. Modify protocol R3 using the content of the packets so as to reduce the time costs. Determine the amount of savings that can be achieved. Exercise 6.6.10 Prove that the communicator described in Section 6.2.1 uses at 1 most O(i k ) time units. Exercise 6.6.11 Use the content of the transmitted bits so as to reduce the time costs of the communicator described in Section 6.2.1. Show how a time cost of at 1 most (k − 1)(I /4) k−1 + k clock ticks can be achieved. Exercise 6.6.12 Prove that communicator Orderk uses f (I, k) + k + 1 time

to comt +k municate I , where f (I, k) is the smallest integer t such that I ≤ . k Prove that communicator Orderk+ uses g(I, k) + k + 1 time

to t + k communicate I , where g(I, k) is the smallest integer t such that I ≤ 2k+1 . k

Exercise 6.6.13

Exercise 6.6.14 Prove that ω(t, k) =

t +q . q

Exercise 6.6.15 Prove that any protocol using k + 1 corruptible bits to communicate values from U requires

f |U |, k

2

|U | −

i

0≤i 2; thus, they cannot be used in pipeline for computing the minimum. Determine a class MonotoneOrderk of optimal corruption-tolerant communicators that are monotonically increasing. Exercise 6.6.22 Communicators Order+k are optimal but not monotonically increasing for k > 2; thus, they can not be used in pipeline for computing the minimum. Determine a class MonotoneOrder+k of optimal communicators that are monotonically increasing. Exercise 6.6.23 Write a protocol for ﬁnding the largest value in a chain using the 2-bit communicator and pipeline. Prove its correctness. Exercise 6.6.24 Minimum-Finding in Pipeline. Write a protocol for ﬁnding the smallest value in a chain using the 2-bit communicator and pipeline. Prove its correctness. Determine its costs. Exercise 6.6.25 Sum-Finding in Pipeline. Write a protocol for ﬁnding the sum of all the values in a chain using the 2-bit communicator and pipeline. Prove its correctness. Determine its costs. Exercise 6.6.26 Protocol SynchStages is the transformation of Stages using communicator TwoBits. Add pipeline to this protocol to convey information from a candidate to a neighboring one. Prove its correctness. Analyze its costs; in particular, determine the reduction in time with respect to the nonpipelined version. Exercise 6.6.27 Modify protocol Wait so that it ﬁnds the minimum value only among the initiators.

EXERCISES, PROBLEMS, AND ANSWERS

395

Exercise 6.6.28 Determine the smallest waiting function that allows protocol Wait to work correctly without simultaneous initiation: (a) in a unidirectional ring; (b) in a bidirectional ring. Exercise 6.6.29 Determine the smallest waiting function that allows protocol Wait to work correctly with simultaneous initiation: (1) in a a × b mesh; (2) in a a × b torus; (3) in a k-dimensional hypercube; (4) in a complete network. Exercise 6.6.30 Determine the smallest waiting function that allows protocol Wait to work correctly without simultaneous initiation: (1) in a a × b mesh; (2) in a a × b torus; (3) in a k-dimensional hypercube. Exercise 6.6.31 Prove that protocol Wait would work even if a quantity n ≥ n is used instead of n. Exercise 6.6.32 Determine under what conditions protocol Wait would work if a quantity n > n is used instead of n in the waiting function. Exercise 6.6.33 Assuming distinct initial values, characterize what would happen to protocol Wait in a ring network if each entity x uses 2id(x)2 as its waiting function. In particular, determine under what conditions the protocol would certainly work. Exercise 6.6.34 Under the conditions of Exercise 6.6.33, show how all the entities can efﬁciently detect whether the protocol does not work. Exercise 6.6.35 Determine the cost of computing the AND of all input values in a synchronous ring of known size n using protocol Waiting. Exercise 6.6.36 Describe how to efﬁciently use protocol Wait to compute the OR of the input values in a synchronous ring of known size n. Determine its cost. Exercise 6.6.37 Modify protocol Symmetry so that it works efﬁciently in a bidirectional square torus of known dimension. Determine its exact costs. Exercise 6.6.38 Modify protocol Symmetry so that it works efﬁciently in a unidirectional square torus of known dimension. Determine its costs. Exercise 6.6.39 Prove that with simultaneous initiation, protocol Symmetry can be modiﬁed so as to work correctly in every network of known girth. (Hint: Use the girth instead of n in the waiting function.) Exercise 6.6.40 Determine the complexity of protocol Symmetry if we use in random selection criteria b = n and choose each value with the same probability n1 .

396

SYNCHRONOUS COMPUTATIONS

Exercise 6.6.41 Modify protocol Decide so as to compute the OR of the input values in a synchronous ring of known size n. Prove its correctness and determine its cost. Exercise 6.6.42 Write protocol Guess and implement it; throughly test your implementation. Exercise 6.6.43 questions.

Show how to ﬁnd imin with k overestimates using q = k M 1/k

Exercise 6.6.44 Show how we can always win the guessing game in an interval of size 2q − 1 with q question if they are all allowed to be overestimates. Exercise 6.6.45 Show how to obtain a unique solution to the recurrence relation of expression 6.33. Exercise 6.6.46 Determine a function g to bound imin so that the total time for ﬁnding with k overestimates is at most 2 h( imin , k) − 1. Exercise 6.6.47 Modify subprotocol Decide(p) so that it will work in every network, regardless of its topology. Assume that an upperbound on the diameter of the network is known a priori. Prove its correctness. Exercise 6.6.48 Modify subprotocol Decide(p) so that protocol Guess works correctly even if the entities do not start simultaneously. Exercise 6.6.49 Prove that, in DoubleWait, if x is being “fooled,” then both the “Wait1” and the “Wait2” message it receives are sent by the same entity. Exercise 6.6.50 Let the entities start the j th iteration of DoubleWait within n − 1 time units from each other. Prove that the entity with the smallest value becomes leader and all other will become defeated in that iteration. Exercise 6.6.51 Let the entities start the j th iteration of DoubleWait within n − 1 time units from each other. Prove that if an entity x becomes leader in this iteration, then g(j ) ≥ n > g(j − 1). Exercise 6.6.52 Let the entities start the j th iteration of DoubleWait within n − 1 time units from each other. Prove that if g(j ) < n, then all entities start the (j + 1)th iteration within n − 1 time units from each other. Exercise 6.6.53 Prove that the time used by protocol DoubleWait, with the choices of f and h speciﬁed by Expressions 6.40 and 6.41, is at most 2(n − 1) + (4 imin + 2) j j =1 g(j ).

EXERCISES, PROBLEMS, AND ANSWERS

397

Exercise 6.6.54 Consider protocol DoubleWait, where f and h are as in Expressions 6.40 and 6.41, and g is superincreasing. Prove that the time is at most 2(n − 1) + (8 imin + 2) g(g −1 (n)). Exercise 6.6.55 Consider protocol DoubleWait, where f and h are as in Expressions 6.40 and 6.41. Determine the number of bits if the time is O(n log n i). Exercise 6.6.56 () Determine whether or not there is a choice of g that makes DoubleWait more efﬁcient than SynchStages in both time and bits. Exercise 6.6.57 Let L = (a1 , b1 ), . . . , (ak , bk ) be the k pairs of distinct labels ai , bi ∈ {1, . . . , n}. Consider now a complete network of n nodes; in this network, select 2k + 1 nodes x0 , x1 , , . . . , x2k . Show that it is always possible 1. to label the links between these nodes only with pairs from L (e.g., the link (x0 , x1 ) will be labeled a3 at x0 and b3 at x1 ), and 2. to label all others links in the network with labels in {1, . . . , n} without violating local orientation anywhere. Exercise 6.6.58 Consider exactly the same question as in Exercise 6.6.57, where, however, n is even and exactly one pair in L, say (a1 , b1 ) is composed of identical labels, i.e., a1 = b1 . Exercise 6.6.59 Prove that in protocol MaxWave, the largest of the local clock values (when the execution starts) will reach (properly increased) every entity, and each entity will set its local clock to such a (properly increased) time value. Exercise 6.6.60 Consider protocol MaxWave when the entities do not start necessarily at the same time, and let d be known. Let t be the (global) time the ﬁrst entities start the execution of the protocol and let t(x) ≥ t be the global time when x starts. Modify the protocol so that (eventhough x does not know t) at time t + 2d it knows for sure that the system is operating in unison. Exercise 6.6.61 Determine the message cost of protocol MaxWave a. in a unidirectional ring, b. in a bidirectional ring. You may assume that n is known. Exercise 6.6.62 Determine the message cost of protocol MaxWave in a kdimensional hypercube. Exercise 6.6.63 Determine the worst-case and average-case message costs of protocol MaxWave in a tree network.

398

SYNCHRONOUS COMPUTATIONS

Exercise 6.6.64 Let, in protocolMaxWave, each entity reset its local clock to 0 when it starts the protocol. Prove that in this way, the maximum value transmitted is at most 2d. Exercise 6.6.65 Consider the unison protocol MinWave where instead of setting the clocks to and propagating the largest value, we set the clock to and propagate the smallest value. Prove correctness, termination, and costs of protocolMinWave. Exercise 6.6.66 Determine the bit and time costs of protocol MaxWave if the content of a message is communicated using the 2-bit communicator. Exercise 6.6.67 Determine the bit and time costs of protocol MaxWave if the content of a message is communicated using a k-bit communicator. Exercise 6.6.68 Show how to solve the ﬁring squad problem on a tree using at most 4n − 4 messages, each containing a value of at most d, and in time at most 3d − 3. Exercise 6.6.69 () Show how to solve the ﬁring squad problem on a tree using only O(n) bits in O(d) time. Exercise 6.6.70 In protocol MaxWave, let a message originated by an initiator reach another entity y at time t + w. Prove that the value of that message (incremented by 1) is exactly w. Exercise 6.6.71 In protocol MaxWave, let a message originated by an initiator reach another entity y at time t + w. Prove that regardless of whether y has already independently started or starts now, the current value of its reset local clock will be smaller than w; thus, y will set its clock in unison with the clocks of the initiators. 6.6.2 Problems Problem 6.6.1 (OneBit Protocol) Determine under what conditions information can be communicated using only 1 bit and describe the corresponding OneBit protocol. Problem 6.6.2 (BitPattern Communicator) Consider the class of communicators that use a bit set to 1 to denote termination. Determine the minimum cost that can be achieved and design the corresponding protocol. Problem 6.6.3 (2-BitPattern Communicator) () Consider the class of communicators that use two successive transmissions of 1 to denote termination. Determine the minimum cost that can be achieved and design the corresponding protocol. Problem 6.6.4 (Size Communicator) Consider the class of communicators that use the ﬁrst quantum to communicate the total number of bits that will be transmitted.

EXERCISES, PROBLEMS, AND ANSWERS

399

Determine the minimum cost that can be achieved and design the corresponding protocol. Problem 6.6.5 (Pipeline in Trees: Max) Write the protocol for ﬁnding the maximum of all the values in a tree using the 2-bit communicator and pipeline. Prove its correctness. Determine its costs. Problem 6.6.6 (Pipeline in Trees: Min) Write the protocol for ﬁnding the minimum of all the values in a tree using the 2-bit communicator and pipeline. Prove its correctness. Determine its costs. Problem 6.6.7 (Maximum Finding I) () Consider a ring of known size n. Each entity has a positive integer value; they all start at the same time, but their values are not necessarily distinct. The maximum-ﬁnding problem is the one of having all the entities with the largest value become maximum and all the other small. Design a protocol to solve the maximum-ﬁnding problem in time linear in imax using at most O(n log n) bits. Problem 6.6.8 (Maximum Finding II) () Determine whether the maximumﬁnding problem in a ring of known size can be solved in time linear in imax with O(n) bits. Problem 6.6.9 (Bit-Optimal Election I) () Show how to elect a leader in a ring with only O(n) bits without knowing n. Possibly the time should be polynomial in i or exponential in n. (Hint: Use a single iteration of DoubleWait as a preprocessing phase.) Problem 6.6.10 (Bit-Optimal Election II) () Determine whether or not it is possible to elect a leader without knowing n with ⌰(n) bits in time sublinear in i, that is, to match the complexity achievable when n is known. Problem 6.6.11 (Unison without knowing d) () Consider the unison problem when there is no known upperbound on the diameter d of the network. Prove or disprove that in this case the unison problem cannot be solved with explicit termination. Problem 6.6.12 (Firing in a Line of CA with 6 States) () Finite cellular automata (CA) can only have a constant memory size, which means they cannot store a counter. The goal is thus to solve the ﬁring squad problem with the least amount of time and to do so with the least amount of memory. The measure we use for the memory is the max number of different values that can to be stored in the memory, and it is called the number of states of the automaton. Consider a line of CA with only one initiator (located at the end of the line). Develop a solution that uses only six states. Problem 6.6.13 (Firing in a Line of CA with 5 States) () Consider a line of CA with only one initiator (located at the end of the line). Develop a solution using only ﬁve states or prove it can not be done.

400

SYNCHRONOUS COMPUTATIONS

6.6.3 Answers to Exercises Answer to Exercise 6.6.4 Consider the entity x that will become leader. It did spontaneously initiate the protocol; its message traveled along the ring at the speed of f (ix ) + 1 = 2ix + 1, where ix is the input value of x; hence, its message returned after (n − 1)(2ix + 1) time units; another n time units are required for the notiﬁcation message. Answer to Exercise 6.6.6 Let

b0 =

0 if I even 1 if I odd

.

If we were to encode I in the sequence b1 | I2 | b0 , the receiver can reconstruct I using as a decoding function decode(b0 | q1 | b1 ) = 2q1 + b0 , where b0 is used as an integer value. In this way, we have effectively cut the quantum of time in half: The waiting time becomes 2 + I2 . It can be actually further reduced. Let b1 =

0 if I2 even 1 if I2 odd

.

I | b0 , the receiver can If we were to encode I in the sequence b1 | 22 reconstruct I using as a decoding function decode(b0 | q1 | b1 ) = 2(2q1 + b1 ) + b0 , where both b0 and b1 are treated as integer values. The waiting time then becomes 2 + I4 . Answer to Exercise 6.6.6 Consider the √ following communicator R3 : The ﬁrst bit, b0 , is used to indicate I is odd; the second bit, b2 , is used to indicate whether z = whether y = √ 2 I − I is odd; the third bit, b3 , is used to indicate whether w = 2z is odd. y w The two quanta waited are q1 = 2 and q2 = 2 . To obtain I the receiver simply computes (2q1 + b0 )2 + (4q2 + 2b1 + b2 ), where the bits are treated as integer values. For example, if I = 7387, we have y = 85, z = 162, and w = 81; thus, the two quanta are q1 = 42 and q2 = 40, while the bits are b0 = 1, b1 = 0, and b2 = 1. The quantity (2q0 + b0 )2 + (4q 1 + b2 ) computed by 1√+2b √ y I the receiver is indeed I . Notice that q0 = 2 = and, as z ≤ 2 I , 2 z √ w I 2 ≤ 2 ; thus, this protocol has time-bits complexity at most q1 = 2 = 2 √ 3, I + 3. The protocol is correct (Exercise 6.6.11). Exactly k − √ 1 quanta will be used, √and Ii ; since I2i = Ii k bits will be transmitted. It is easy to verify that I2i+1 ≤

EXERCISES, PROBLEMS, AND ANSWERS

by deﬁnition, it as follows that each quantum is at most complexity is at most

x 4

1 k−1

401

. Hence, the time

1

(k − 1)(I /4) k−1 + k. Partial Answer to Exercise 6.6.11 The encoding of I can be deﬁned recursively as follows: E(I ) = b0 | E(I1 ) | bk−1 , where E(Ii ) =

E(I2i ) | bi | E(I2i+1 ) if 1 < i < k − 1 quantum of length Ii if k − 1 ≤ i ≤ 2k − 3

I I1 =

2

2

, I2i = Ii , I2i+1 =

bi = I2i+1 mod 2, bk−1

Ii − I2i2 2

, and

I = mod 2. 2

To obtain I , the receiver will recursively compute Ii = I2i2 + (2I2i+1 + bi ) until I1 is determined; then, I = 4I1 + 2bk−1 + b1 . Answer to Exercise 6.6.14 We want to prove that ω(t, k) =

t +q

. Let w = ω(t, k); by deﬁnition, it must q be possible to communicate any element in Zw = {0, 1, . . . , w} using q = k − 1 distinguished quanta requiring at most time t. In other words, ω(t, q + 1) is equal to the number of distinct q-tuples t1 , t2 , . . . , tq of positive integers such that 1≤i≤k ti ≤ t. Given a positive integer x, let Tk [x] denote the number of compositions of x of size q, that is,

xj = x, xj ∈ Z + }|. Tq [x] = |{x1 , x2 , . . . , xq :

x+q −1 , it follows that As Tq [x] = q −1

i+q −1

t +q Tq [i] = = , ω(t, q + 1) = q − 1 q i

i t +q which proves that ω(t, k) = . q

402

SYNCHRONOUS COMPUTATIONS

Answer to Exercise 6.6.15 Let f (|U |, q) = t. First of all we prove that for any solution protocol P for Cq+1 (U ), there exists a partition of U into t + 1 disjoint subsets U0 , U1 , . . . , Ut , such that 1. |Ui | =

i+q −1

t +q −1

, 0 ≤ i < t, |Ut | ≤ , q −1 q −1 2. the time P (x) required by P to communicate x ∈ Ui is P (x) ≥ i. As f (|U |, q) = t, by Equation 6.9, U is the largest set for which the two-party communication problem can always be solved using b = q + 1 transmissions and at most t additional time units. Given a protocol P for Cq+1 (U ), order the elements x ∈ U according to the time P (x) required by P to communicate them; let U¨ be the corresponding ordered set. Deﬁne U¨i to be the subset composed of the elements of U¨ whose deﬁned above, is in ranking, with respect to the ordering

j +q −1 j +q −1 , 0≤j ≤i . As f (|U |, q) = t, it folthe range 0≤j . In other words, in protocol P , the number k−1 of elements that are uniquely identiﬁed using q quanta for a total of j time is j + q − 1 greater than the number Tq [j ] = compositions of j of size k: a q −1 clear contradiction. Hence, for every x ∈ U¨i , P (x) ≥ i, proving part 2. At this point, the rest of the proof easily follows. Answer to Exercise 6.6.17 q+1 The number of distinct assignment of values to q + 1 distinguished bits is 2 . The number of distinct q-tuples t1 , t2 , tq of positive integers such that j tj ≤ t is

t +q q+1 q+1 . ω(t, k) (from 6.9). Therefore, β(t, k) = 2 ω(t, k) = 2 q Partial answer to Exercise 6.6.19 First prove the following: Let µ(|U |, q) = t; for any solution protocol P using k reliable bits to communicate values from U , there exists a partition of U into t + 1

EXERCISES, PROBLEMS, AND ANSWERS

403

disjoint subsets U0 , U1 , . . . , Ut , such that

i + q − 1 t + q − 1 1. |Ui | = 2q+1 , 0 ≤ i < t, and |Ut | = 2q+1 , q −1 q −1 2. the time P (x) required by P to communicate x ∈ Ui is P (x) ≥ i.

Then the rest of the proof easily follows. Answer to Exercise 6.6.44 Hint: Use binary search. Answer to Exercise 6.6.49 Let x be fooled and incorrectly become leader at the end of the j th iteration. According to the algorithm the only way that x has for becoming leader is the following: 1. At time t(x, j ), x starts waiting for f (x, j ). Note that during this time x must not receive any message to become a leader later. 2. At time t(x, j ) + f (x, j ), x sends a “Wait1” message and becomes checking. 3. At time t(x, j ) + f (x, j ) + nx (j ), it receives a Wait1 message and starts the second waiting. Note that during this time, x must not receive any message in order to become a leader later. 4. At time t(x, j ) + f (x, j ) + nx (j ) + h(x, j ), it sends a “Wait2” message and becomes checking-again. 5. At time t(x, j ) + f (x, j ) + g(x, j ) + 2nx (j ), it receives a “Wait2” message and becomes leader. Let y = x and z = x be the entities that originated the “Wait1” and “Wait2” messages, respectively, received by x. Notice that to originate these messages, y and z can not be passive (they might become so later, though). The “Wait1” message is sent by y only after it successfully ﬁnished the waiting f (y, j ) time units. That is, the “Wait1” message will be sent by y at time t(y, j ) + f (y, j ). This message requires d(y, x) unit times to reach x. Therefore, t(x, j ) + f (x, j ) + m(x, j ) = t(y, j ) + f (y, j ) + d(y, x). The “Wait2” message will arrive at x at time t(x, j ) + f (x, j ) + 2m(x, j ) + g(x, j ). By contradiction, let z = y. Consider ﬁrst the case when y is located in the path from z to x. In this case, the “Wait2” message originated by z will reach y before x. If y is still waiting to receive a “Wait1” message, the reception of this not forward the “Wait2” message and “Wait2” message will alert it to something wrong; it will not forward the “Wait2” message to x and send a “Restart” instead, and thus, x will not become leader. Therefore, z is located on the path from y to

404

SYNCHRONOUS COMPUTATIONS

x. In this case, the “Wait1” message originated by y reaches z before arriving to x. As we have assumed that this message will arrive to x, it means that z must have forwarded it; the only way it could have done so is by becoming passive, but in this case z will not originate a Wait2 message, contradicting the assumption we have made. Answer to Exercise 6.6.50 Let x be the entity with the smallest id, and denote this value by i. Entity x will start at time t(x, j ) and would stop waiting at time t(x, j ) + f (x, j ). As the entities start the iteration within time units from each other, for every other entity j t(x, j ) − t(y, j ) ≤ n − 1; as d(x, y) ≤ n − 1, this means that t(x, j ) + f (x, j ) + d(x, y) ≤ f (x, j ) + 2(n − 1). Recall that f is a waiting function; this means that as x has the smallest identity and g(j ) ≥ n, f (x, j ) + 2(n − 1) < f (y, j ) for every other entity y. Thus, t(x, j ) + f (x, j ) + d(x, y) < f (y, j ). That is, x will ﬁnish waiting before anybody else; its message will travel along the ring transforming into passive all other entities and will reach x after nx = n time units. Thus, x will be the only entity starting the second waiting, and its “Wait2” message will reach x again after nx = n time units. Hence, x will validate its guess, become leader, and notify all other entities of termination. Answer to Exercise 6.6.52 We know (Exercises 6.6.50 and 6.6.51) that if n ∈ / ∂(j − 1), then no entity becomes a leader in the (j −1)th iteration. According to the leader election algorithm, if an entity becomes neither leader nor passive during the (j −1) iteration, it becomes active and unconditionally sends an R message for the jth iteration. At this point the jth iteration starts with bounded delays. The proof of this Lemma is based on the proof that is impossible for all the entities in the (j − 1)th iteration become passive and, therefore, no leader is elected and there is no active entities that can send the R message. First, let x be the entity with the smallest ix , called i. And let all the entities become passive in the (j −1)th iteration. Note that according to the algorithm the only way for an entity to become passive is receiving a C message when is in the waiting state, that is, during f (x, j −1) the entity x must receive a C message in order to become passive. Let y denote the entity that originates the C message. The C message will be arriving to x in exactly t(y, j −1 ) + f (y, j −1) + d(y, x) time units. Thus, in order that x becomes passive, it follows that t(x, j − 1) + f (x, j − 1) > t(y, j − 1) + f (y, j − 1) + d(y, x) t(x, j − 1) + i(bj −1 + 1) > t(y, j − 1) + iy (bj −1 + 1) + d(y, x).

EXERCISES, PROBLEMS, AND ANSWERS

405

As i is the smallest value, i < iy and, therefore, i(bj −1 + 1) < iy (bj −1 + 1). Then to hold (3), it must be t(x, j − 1) > t(y, j − 1) + d(y, x), contradicting the fact that all the entities start the (j −1)th iteration with bounded delay. Therefore, it is impossible that all the entities become passive in any iteration. In conclusion, if n ∈ / ∂(j − 1) an R message is sent by an active entity and the next iteration start with bounded delays proving in this way the Lemma 3. Answer to Exercise 6.6.53 Let x be the entity with the smallest value, and let i be that value. Entity x starts executing the protocol at most n − 1 time units after the other entities. It starts the (j + 1)th iteration less than f (x, j ) + 2nx (j ) + h(x, j ) time units after x started the j th iteration. As f (x, j ) + g(x, j ) + 2nx (j ) = 2g(j )i + 2g(j )i + g(j ) − nx (j ) + 2nx (j ) = (4i + 1)g(j ) + nx (j ), the total time required until x becomes leader is at most n−1+

j

((4i + 1)g(j ) + nx (j )).

j =1

As there are also the n − 1 time units before the “Terminate” message notiﬁes all entities, the total time for the algorithm is at most 2(n − 1) +

j

((4i + 1)g(j ) + nx (j )).

j =1

Notice that if g(j ) < nx (j ), then x would detect the anomaly and send a “Restart”; thus, we can assume that in the expression above the actual time spent is Min{g(j ), nx (j )}. Then the above expression becomes: 2(n − 1) + (4i + 2) j j =1 g(j ). Answer to Exercise 6.6.54 j−1 The last iteration is j = g −1 (n); as g is superincreasing, g(j) ≥ i=1 g(j ). The j algorithm terminates in less than 2(n − 1) + (4 imin + 2) j =1 g(j ) time units. j Now, (4imin + 2) j =1 g(j ) ≤ 2(n − 1) + (4imin + 2)2g(j). Answer to Exercise 6.6.60 Sketch: Use a counter, initially set to 0; in each step, set it to the largest of the received counters increased by one and add it to any message sent in that step. When the counter is equal to 2d, stop.

406

SYNCHRONOUS COMPUTATIONS

Answer to Exercise 6.6.68 Use saturation: Each of the two saturated nodes computes its eccentricity; the largest of the two is communicated to their subtrees, starting a “countdown.” When the furthermost entity receives the message, their value becomes simultaneously 0 and they all enter state ﬁring at the same time. This protocol uses at most 3n − 2 signals for the wake-up and saturation and an additional n − 2 messages for the countdown, each containing a value of at most d. The time is at most 2d for wake-up and saturation; at most, additional d time units are needed for the countdown.

BIBLIOGRAPHY [1] P. Alimonti, P. Flocchini, and N. Santoro. Finding the extrema of a distributed multiset of values. Journal of Parallel and Distributed Computing, 37:123–133, 1996. [2] A. Arora, S. Dolev, and M. Gouda. Maintaining digital clocks in step. Parallel Processing Letters, 1(1):11–18, 1991. [3] H. Attiya, M. Snir, and M.K. Warmuth. Computing on an anonymous ring. Journal of the ACM, 35(4):845–875, 1988. [4] A. Bar-Noi, J. Naor, and M. Naor. One bit algorithms. Distributed Computing, 4(1):3–8, 1990. [5] H.L. Bodlaender and G. Tel. Bit optimal election in synchronous rings. Information Processing Letters, 36(1):53–56, 1990. [6] S. Even and S. Rajsbaum. The use of a synchronizer yields maximum computation rate in distributed networks. In 22nd ACM Symposium on Theory of Computing, pages 95–105, 1990. [7] S. Even and S. Rajsbaum. Unison, canon and sluggish clocks in networks controlled by a synchronizer. Mathematical System Theory, 28:421–435, 1995. [8] P. Flocchini. Informazione Strutturata e Calcolo Distribuito. PhD thesis, University of Milan, Milano, Italy, 1995. [9] G.N. Frederickson and N.A. Lynch. Electing a leader in a synchronous ring. Journal of the ACM, 34(1):95–115, 1987. [10] G.N. Frederickson and N. Santoro. Breaking symmetry in synchronous networks. In T. Papatheodorou F. Makedon K. Mehlhorn and P. Spirakis, editors, VLSI Algorithms and Architectures, volume 227 of LNCS, pages 26–33, Loutraki, July 1986. [11] E. Gafni. Improvements in the time complexity of two message-optimal election algorithms. In 4th ACM Symposium on Principles of Distributed Computing, pages 175–185, Minaki, Aug. 1985. [12] M. Gouda and T. Herman. Stabilizing unison. Information Processing Letters, 35(4):171– 175, 1990. [13] A. Israeli, E. Kranakis, D. Krizanc, and N. Santoro. Time-message trade-offs for the weak unison problem. Nordic Journal of Computing, 4(4):317–329, Winter 1997. [14] A. Itai and M. Rodeh. Symmetry breaking in distributed networks. Information and Computation, 88(1):60–87, Sept. 1990.

BIBLIOGRAPHY

407

[15] E. Korach, D. Rotem, and N. Santoro. Distributed algorithms for ﬁnding centers and medians in networks. ACM Transactions on Programming Languages and Systems, 6(3):380– 401, July 1984. [16] J. van Leeuwen, N. Santoro, J. Urrutia, and S. Zaks. Guessing games and distributed computations in synchronous networks. In 14th International Colloquium on Automata, Languages and Programming, pages 347–356, Karlsruhe, 13–17 July 1987. [17] A. Marchetti-Spaccamela. New protocols for the election of a leader in a ring. Theoretical Computer Science, 54(1):53–64, 1987. [18] E.F. Moore. The ﬁring squad synchronization problem. In Sequential Machines: Selected Papers, pages 213–214. Addison-Wesley, 1964. [19] U.-M. O’Reilly and N. Santoro. Asynchronous to synchronous transformations. In 4th International Conference on Principles of Distributed Systems, pages 265–282, Paris, 2000. [20] U.-M. O’Reilly and N. Santoro. Tight bounds for synchronous communication of information using bits and silence. Discrete Applied Mathematics, 129:195–209, 2003. [21] M.H. Overmars and N. Santoro. Improved bounds for electing a leader in a synchronous ring. Algorithmica, 18(2):246–262, June 1997. [22] R.J. Ramirez and N. Santoro. Distributed control of updates in multiplecopy databases: a time optimal algorithm. In 4th Berkeley Conference on Distributed Data Management and Computer Networks, pages 191–207, Berkeley, August 1979. [23] N. Santoro and D. Rotem. On the complexity of distributed elections in synchronous graphs. In 11th International Workshop on Graph-Theoretical Concepts in Computer Science, pages 337–346, 1985. [24] B. Schmeltz. Optimal tradeoff between time and bit complexity in synchronous rings. In 7th Symposium on Theoretical Computer Science, pages 275–284, 1990. [25] P.G. Spirakis and B. Tampakas. Efﬁcient distributed algorithms by using the archimedean time assumption. Informatique Theorique et Applications, 23(1):113–128, 1989. [26] P. Vitanyi. Distributed elections in an archimedean ring of processors. In 16th ACM Symposium on Theory of Computing, pages 542–547, 1984.

CHAPTER 7

Computing in Presence of Faults

7.1 INTRODUCTION In all previous chapters, with few exceptions, we have assumed total reliability, that is, the system is failure free. Unfortunately, total reliability is practically nonexistent in real systems. In this chapter we will examine how to compute, if possible, when failures can and do occur. 7.1.1 Faults and Failures We speak of a failure (or fault) whenever something happens in the systems that deviates from the expected correct behavior. In distributed environments, failures and their causes can be very different in nature. In fact, a malfunction could be caused by a design error, a manufacturing error, a programming error, physical damage, deterioration in the course of time, harsh environmental conditions, unexpected inputs, operator error, cosmic radiations, and so forth. Not all faults lead (immediately) to computational errors (i.e., to incorrect results of the protocol), but some do. So the goal is to achieve fault-tolerant computations, that is, our aim is to design protocols that will proceed correctly in spite of the failures. The unpredictability of the occurrence and nature of a fault and the possibility of multiple faults render the design of fault-tolerant distributed algorithms very difﬁcult and complex, if at all possible. In particular, the more components (i.e., entities, links) are present in the system, the greater is the chance of one or more of them being/becoming faulty. Depending on their cause, faults can be grouped into three general classes: execution failures, that is, faults occurring during the execution of the protocol by an entity; examples of protocol failures are computational errors occurring when performing an action, as well as execution of the incorrect rule. transmission failures, due to the incorrect functioning of the transmission subsystem; examples of transmission faults are the loss or corruption of a transmitted message as well as the delivery of a message to the wrong neighbor.

Design and Analysis of Distributed Algorithms, by Nicola Santoro Copyright © 2007 John Wiley & Sons, Inc.

408

INTRODUCTION

409

component failures, such as the deactivation of a communication link between two neighbors, the shutdown of a processor (and thus of the corresponding entity), and so forth. Note that the same fault can occur because of different causes, and hence classiﬁed differently. Consider, for example, a message that an entity x is supposed to send (according to the protocol) to a neighbor y but never arrives. This fault could have been caused by x failing to execute the “send” operation in the protocol: an execution error; by the loss of the message by the transmission subsystem: a transmission error; or by the link (x, y) going down: a component failure. Depending on their duration, faults are classiﬁed as transient or permanent. A transient fault occurs and then disappears of its own accord, usually within a short period of time. A bird ﬂying through the beam of a microwave transmitter may cause lost bits on some network. A transient fault happens once in a while; it may or may not reoccur. If it continues to reoccur (not necessarily at regular intervals), the fault is said to be intermittent. A loose contact on a connector will often cause an intermittent fault. Intermittent faults are difﬁcult to diagnose. A permanent failure is one that continues to exist until the fault is repaired. Burnout chips, software bugs, and disk head crashes often cause permanent faults. Depending on their geographical “spread”, faults are classiﬁed as localized or ubiquitous. Localized faults occur always in the same region of the system, that is, only a ﬁxed (although a priori unknown) set of entities/links will exhibit a faulty behavior. Ubiquitous faults will occur anywhere in the system, that is, all entities/links will exhibit at some point or another a faulty behavior. Note that usually transient failures are ubiquitous, while intermittent and permanent failures tend to be localized. Clearly no protocol can be resilient to an arbitrary number of faults. In particular, if the entire system collapses, no protocol can be correct. Hence, the goal is to design protocols that are able to withstand up to a certain amount of faults of a given type. Another fact to consider is that not all faults are equally dangerous. The danger of a fault lies not necessarily in the severity of the fault itself but rather in the consequences that its occurrence might have on the correct functioning of the system. In particular, danger for the system is intrinsically related to the notion of detectability. In general, if a fault is easily detected, a remedial action can be taken to limit or circumvent the damage; if a fault is hard or impossible to detect, the effects of the initial fault may spread throughout the network creating possibly irreversible damage. For example, the permanent fault of a link going down forever is obviously more severe than if that link failure is just transient. In contrast, the permanent failure of the link might be more easily detectable, and thus can be taken care of, than the occasional mulfanctioning

410

COMPUTING IN PRESENCE OF FAULTS

of the link. In this example, the less severe fault (the transient one) is potentially more dangerous for the system. With this in mind, when we talk about fault-tolerant protocols and fault-resilient computations, we must always qualify the statements and clearly specify the type and number of faults that can be tolerated. To do so, we must ﬁrst understand what are the limits to the fault tolerance of a distributed computing environment, expressed in terms of the nature and number of faults that make a nontrivial computation (im)possible. 7.1.2 Modeling Faults Given the properties of the system and the types of faults assumed to occur, one would like to know the maximum number of faults that can be tolerated. This number is called the resiliency. To establish the resiliency, we need to be more precise on the types of faults that can occur. In particular, we need to develop a model to describe the failures in the system. Faults, as mentioned before, can be due to execution errors, transmission errors, or component failures; the same fault could be caused by any of those three causes and hence could be in any of these three categories. There are several failure models, each differing on what is the factor “blamed” for a failure. IMPORTANT. Each failure model offers a way of describing (some of the) faults that can occur in the system. A model is not reality, only an attempt to describe it. Component Failure Models The more common and most well known models employed to discuss and study fault tolerance are the component failures models. In all the component failure models, the blame for any fault occurring in the system must be put on a component, that is, only components can fail, and if something goes wrong, it is because one of the involved components is faulty. Depending on which components are blamed, there are three types of component failure models: entity, link, and hybrid failure models. In the entity failure (EF) model, only nodes can fail. For example, if a node crashes, for whatever reason, that node will be declared faulty. In this model, a link going down will be modeled by declaring one of the two incident nodes to be faulty and to lose all the message to and from its neighbor. Similarly, the corruption of a message during transmission must be blamed on one of the two incident nodes that will be declared to be faulty. In the link failure (LF) model, only links can fail. For example, the loss of a message over a link will lead to that link being declared faulty. In this model, the crash of a node is modeled by the crash of all its incident links. The event of an entity computing some incorrect information (because of a execution error) and sending it to a neighbor, will be modeled by blaming the link connecting the entity to the neighbor; in particular, the link will be declared to be responsible for corrupting the content of the message.

INTRODUCTION

411

Crash

Send Omission

Receive Omission

Send/Receive Omission

Byzantine

FIGURE 7.1: Hierarchy of faults in the EF model.

In the hybrid failure (HF) model, both links and nodes can be faulty. Although more realistic, this model is little known and seldom used. NOTE. In all three component failure models, the status faulty is permanent and is not changed, even though the faulty behavior attributed to that component may be never repeated. In other words, once a component is marked with being faulty, that mark is never removed; so, for example, in the link failure model, if a message is lost on a link, that link will be considered faulty forever, even if no other message will ever be lost there. Let us concentrate ﬁrst on the entities failure model. That is, we focus on systems where (only) entities can fail. Within this environment, the nature of the failures of the entities can vary. With respect to the danger they may pose to the system, a hierarchy of failures can be identiﬁed. 1. With crash faults, a faulty entity works correctly according to the protocol, then suddenly just stops any activity (processing, sending, and receiving messages). These are also called fail-stop faults. Such a hard fault is actually the most benign from the overall system point of view. 2. With send/receive omission faults, a faulty entity occasionally loses some received messages or does not send some of the prepared messages. This type of faults may be caused by buffer overﬂows. Notice that crash faults are just a particular case of this type of failure: A crash is a send/receive omission in which all messages sent to and and from that entity are lost. From the point of view of detectability, these faults are much more difﬁcult than the previous one. 3. With Byzantine faults, a faulty entity is not bound by the protocol and can perform any action: It can omit to send or receive any message, send incorrect

412

COMPUTING IN PRESENCE OF FAULTS

information to its neighbors, behave maliciously so as to make the protocol fail. Undetected software bugs often exhibit Byzantine faults. Clearly, dealing with Byzantine faults is going to be much more difﬁcult than dealing with the previous ones. A similiar hierarchy between faults exists in the link as well as in hybrid failures models. Communication Failures Model A totally different model is the communication failure or dynamic fault (DF) model; in this model, the blame for any fault is put on the communication subsystem. More precisely, the communication system can lose, corrupt, and deliver to the incorrect neighbor. As in this model, only the communication system can be faulty, a component fault such as the crash failure of a node, is modeled by the communication system losing all the messages sent to and from that node. Notice that in this model, no mark (permanent or otherwise) is assigned to any component. In the communication failure model, the communication subsystem can cause only three types of faults: 1. An omission: A message sent by an entity is never delivered. 2. An addition: A message is delivered to an entity, although none was sent. 3. A corruption: A message is sent but one with different content is received. While the nature of omissions and corruptions is quite obvious, that of additions is less so. Indeed, it describes a variety of situations. The most obvious one is when sudden noise in the transmission channel is mistaken for transmission of information by the neighbor at the other end of the link. The more important occurrence of additions in sytems is rather subtle, as an addition models the reception of a “nonauthorized message” (i.e., a message not transmitted by any authorized user). In this sense, additions model messages surreptitiously inserted in the system by some outside, and possibly malicious, entity. Spam being sent from an unsuspecting site clearly ﬁts the description of an addition. Summarizing, additions do occur and can be very dangerous. These three types of faults are quite incomparable with each other in terms of danger. The hierarchy comes into place when two or all of these basic fault types can simultaneously occur in the system. The presence of all three types of faults creates what is called a Byzantine faulty behavior. The situation is depicted in Figure 7.2. Clearly, no protocol can tolerate any number of faults of any type. If the entire system collapses, no computation is possible. Thus, when we talk about fault-tolerant protocols and fault-resilient computations, we must always qualify the statements and clearly specify the type and number of faults that can be tolerated. 1

The term “Byzantine” refers to the Byzantine Empire (330–1453 AD), the long-lived eastern component of the Roman Empire whose capital city was Byzantium (now Istanbul), in which endless conspiracies, intrigue, and untruthfulness were alleged to be common among the ruling class.

INTRODUCTION

Omission

Addition

Corruption

Omission + Addition

Omission + Corruption

Addition + Corruption

413

Byzantine

FIGURE 7.2: Hierarchy of combinations of fault types in the DF model.

7.1.3 Topological Factors Our goal is to design protocols that can withstand as many and as dangerous faults as possible and still exhibit a reasonable cost. What we will be able to do depends not only on our ability as designers but also on the inherent limits that the environment imposes. In particular, the impact of a fault, and thus our capacity to deal with it and design fault-tolerant protocols, depends not only on the type and number of faults but also on the communication topology of the system, that is, on the graph G. This is because all nontrivial computations are global, that is, they require the participation of possibly all entities. For this reason, Connectivity is a restriction required for all nontrivial computations. Even when initially existent, in the lifetime of the system, owing to faults, connectivity may cease to hold, rendering correctness impossible. Hence, the capacity of the topological structure of the network to remain connected in spite of faults is crucial. There are two parameters that directly link topology to reliability and fault tolerance: edge connectivity cedge (G) is the minimum number of edges whose removal destroys the (strong) connectivity of G; node connectivity cnode (G) is the minimum number of nodes whose removal destroys the (strong) connectivity of G. NOTE. In the case of a complete graph, the node connectivity is always deﬁned as n − 1. Clearly, the higher the connectivity, the higher the resilience of the system to component failures. In particular, Property 7.1.1 If cedge (G) = k, then for any pair x and y of nodes there are k edge-disjoint paths connecting x to y.

414

COMPUTING IN PRESENCE OF FAULTS

Network G Tree T Ring R Torus T r Hypercube H Complete K

Node Connectivity cnode (G) 1 2 4 log n n−1

Edge Connectivity cedge (G) 1 2 4 log n n−1

Max Degree deg(G) ≤n−1 2 4 log n n−1

FIGURE 7.3: Connectivity of some networks.

Property 7.1.2 If cnode (G) = k, then for any pair x and y of nodes there are k node-disjoint paths connecting x to y. Let us consider some examples of connectivity. A tree T has the lowest connectivity of all undirected graphs: cedge (T ) = cnode (T ) = 1, so any failure of a link or a node disconnects the network. A ring R faces little better as cedge (R) = cnode (R) = 2. Higher connectivity can be found in denser graphs. For example, in a hypercube H , both connectivity parameters are log n. Clearly the highest connectivity is to be found in the complete network K. For a summary, see Figure 7.3. Note that in all connected networks G the node connectivity is not greater than the edge connectivity (Exercise 7.10.1) and neither can be better than the maximum degree: Property 7.1.3 ∀G, cnode (G) ≤ cedge (G) ≤ deg(G) As an example of the impact of edge connectivity on the existence of fault-tolerant solutions, consider the broadcast problem Bcast. Lemma 7.1.1 If k arbitrary links can crash, it is impossible to broadcast unless the network is (k + 1)-edge-connected. Proof. If G is only k-edge-connected, then there are k edges whose removal disconnects G. The failure of those links will make some nodes unreachable from the initiator of the broadcast and, thus, they will never receive the information. By contrast, if G is (k + 1)-edge-connected, then even after k links go down, by Property 7.1.1, there is still a path from the initiator to all other nodes. Hence ﬂooding will correctly work. 䊏 As an example of the impact of node-connectivity on the existence of fault-tolerant solutions, consider the problem of an initiator that wants to broadcast some information, but some of the entities may be down. In this case, we just want the nonfaulty entities to receive the information. Then (Exercise 7.10.2), Lemma 7.1.2 If k arbitrary nodes can crash, it is impossible to broadcast to the nonfaulty nodes unless the network is (k + 1)-node-connected.

INTRODUCTION

415

7.1.4 Fault Tolerance, Agreement, and Common Knowledge In most distributed computations there is a need to have the entities to make a local but coordinated decision. This coordinated decision is called an agreement. For example, in the election problem, every entity must decide whether it is the leader or not. The decision is local but must satisfy some global constraint (only one entity must become leader); in other words, the entities must agree on which one is the leader. For any problem requiring an agreement, the sets of constraints deﬁning the agreement are different. For example, in minimum ﬁnding, the constraint is that all and only the entities with the smallest input value must become minimum. For example, in ranking when every entity has an initial data item, the constraint is that the value decided by each entity is precisely the rank of its data item in the overall distributed set. When there are no faults, reaching these agreements is possible (as we have seen in the other chapters) and often straightforward. Unfortunately, the picture changes dramatically in presence of faults. Interestingly, the impact that faults have on problems requiring agreement for their solution has common traits, in spite of the differences of the agreement constraints. That is, some of the impact is the same for all these problems. For these reasons, we consider an abstract agreement problem where this common impact of faults on agreements is more evident. In the p-Agreement Problem (Agree(p)), each entity x has an input value v(x) from some known set (usually {0, 1}) and must terminally decide upon a value d(x) from that set within a ﬁnite amount of time. Here, “terminally” means that once made, the decision cannot be modiﬁed. The problem is to ensure that at least p entities decide on the same value. Additional constraints, called nontriviality (or sometimes validity constraints), usually exist on the value to be chosen; in particular, if all values are initially the same, the decision must be on that value. This nontriviality constraint rules out default-type solutions (e.g., “always choose 0”). Depending on the value of p, we have different types of agreement problems. Of particular interest is the case of p = n2 + 1 that is called strong majority. When p = n, we have the well known Unanimity or Consensus Problem (Consensus) in which all entities must decide on the same value, that is, ∀x, y ∈ E, d(x) = d(y).

(7.1)

The consensus problem occurs in many different applications. For example, consider an aircraft where several sensors are used to decide if the moment has come to drop a cargo; it is possible that some sensors detect “yes” while others “not yet.” On the basis of these values, a decision must be made on whether or not the cargo is to be dropped now. A solution strategy for our example is to drop the cargo only if all sensors agree; another is to decide for a drop as soon as at least one of the sensors indicates so. Observe that the ﬁrst solution corresponds to computing the AND of the sensors’ values; in the consensus problem this solution corresponds to each entity x setting d(x) = AND({v(y) : y ∈ E}). The second solution consists of determining the

416

COMPUTING IN PRESENCE OF FAULTS

OR of those values, that is, d(x) = OR({v(y) : y ∈ E}). Notice that in both strategies, if the initial values are identical, each entity chooses that value. Another example is in distributed database systems, where each site (the entity) of the distributed database must decide whether to accept or drop a transaction; in this case, all sites will agree to accept the transaction only if no site rejects the transaction. The same solutions strategy apply also in this case. Summarizing, if there are no faults, consensus can be easily achieved (e.g., by computing the AND or the OR of the values). Lower forms of agreement, that is, when p < n, are even easier to resolve. In presence of faults, the situation changes drastically and even the problem must be restated. In fact, if an entity is faulty, it might be unable to participate in the computation; even worse, its faulty behavior might be an active impediment for the computation. In other words, as faulty entities cannot be required to behave correctly, the agreement constraint can hold only for the nonfaulty entities. So, for example, a consensus problem we are interested in is Entity-Fault-Tolerant Consensus (EFTConsensus). Each nonfaulty entity x has an input value v(x) and must terminally decide upon a value d(x) within a ﬁnite amount of time. The constraints are 1. agreement: all nonfaulty entities decide on the same value; 2. nontriviality: if all values of the nonfaulty elements are initially the same, the decision must be on that value. Similarly, we can deﬁne lower forms (i.e., when p < n) of agreement in presence of entity failures (EFT-Agree(p)). For simplicity (and without any loss of generality), we can consider the Boolean case, that is when the values are all in {0, 1}. Possible solutions to this problem are, for example, computing AND or the OR of the input values of the nonfaulty entities, or the value of an elected leader. In other words, consensus (fault tolerant or not) can be solved by solving any of a variety of other problems (e.g., function evaluation, leader election, etc.). For this reason, the consensus problem is elementary: If it cannot be solved, then none of those other problems can be solved either. Reaching agreement, and consensus in particular, is strictly connected with the problem of reaching common knowledge. Recall (from Section 1.8.1) that common knowledge is the highest form of knowledge achievable in a distributed computing environment. Its connection to consensus is immediate. In fact, any solution protocol P to the (fault-tolerant) consensus problem has the following property: As it leads all (nonfaulty) entities to decide on the same value, say d, then within ﬁnite time the value d becomes common knowledge among all the nonfaulty entities. By contrast, any (fault-tolerant) protocol Q that creates common knowledge among all the nonfaulty entities can be used to make them decide on a same value and thus achieve consensus. IMPORTANT. This implies that common knowledge is as elementary as consensus: If one cannot be achieved, neither can be other.

THE CRUSHING IMPACT OF FAILURES

417

7.2 THE CRUSHING IMPACT OF FAILURES In this section we will examine the impact that faults have in distributed computing environments. As we will see, the consequences are devastating even when faults are limited in quantity and danger. We will establish these results assuming that the entities have distinct values (i.e., under restriction ID); this makes the bad news even worse. 7.2.1 Node Failures: Single-Fault Disaster In this section we examine node failures. We consider the possibility that entities may fail during the computation and we ask under what conditions the nonfaulty entities may still carry out the task. Clearly, if all entities fail, no computation is possible; also, we have seen that some faults are more dangerous than others. We are interested in computations that can be performed, provided that at most a certain number f of entities fail, and those failures are of a certain type τ (i.e., danger). We will focus on achieving fault-tolerant consensus (problem EFT-Consensus described in Section 7.1.4), that is, we want all nonfailed entities to agree on the same value. As we have seen, this is an elementary problem. A ﬁrst and immediate limitation to the possibility of achieving consensus in presence of node failures is given by the topology of the network itself. In fact, by Lemma 7.1.2, we know that if the graph is not (k + 1)-node-connected, a broadcast to nonfaulty entities is impossible if k entities can crash. This means that Lemma 7.2.1 If k ≥ 1 arbitrary entities can possibly crash, fault-tolerant consensus can not be achieved if the network is not (k + 1)-node-connected. This means, for example, that in a tree, if a node goes down, consensus among the others cannot be achieved. Summarizing, we are interested in achieving consensus, provided that at most a given number f of entities fail, those failures are of at most a certain type τ of danger, and the node-connectivity of the network cnode is high enough. In other words, the problem is characterized by those three paramenters, and we will denote it by EFTConsensus(f, τ, cnode ). We will start with the simplest case: f = 1, that is, at most one entity fails; τ = crash, that is, if an entity fails, it will be in the most benign way; cnode = n − 1, that is, the topology is not a problem as we are in the complete graph. In other words, we are in a complete network (every entity is connected to every other entity); at most one entity will crash, leaving all the other entities connected to each other. What we want is that these other entities agree on the same value, that is, we want to solve problem EFT-Consensus(1, crash, n − 1). Unfortunately,

418

COMPUTING IN PRESENCE OF FAULTS

Theorem 7.2.1 solvable.

(Single-Fault Disaster) EFT-Consensus (1, crash, n − 1) is un-

In other words, fault-tolerant consensus cannot be achieved even under the best of conditions. This really means that it is impossible to design fault-tolerant solutions for practically all important problems, as each could be used to achieve fault-tolerant consensus. Before proceeding further with the consequences of this result, also called FLP Theorem (after the initials of those who ﬁrst proved it), let us see why it is true. What we are going to do is to show that no protocol can solve this problem, that is, no protocol always correctly terminate within ﬁnite time if an entity can crash. We will prove it by contradiction. We assume that a correct solution protocol P indeed exists and then show that there is an execution of this protocol in which the entities fail to achieve consensus in ﬁnite time (even if no one fails at all). The proof is neither simple nor complex. It does require some precise terminology and uses some constructs that will be very useful in other situations also. We will need not only to describe the problem but also to deﬁne precisely the entire environment, including executions, events, among others. Some of this notation has already been introduced in Section 1.6. Terminology Let us start with the problem. Each entity x has an input register Ix , a write-once output register Ox , as well as unlimited internal storage. Initially, the input register of an entity is a value in {0, 1}, and all the output registers are set to the same value b ∈ / {0, 1}; once a value dx ∈ {0, 1} is written in Ox , the content of that register is no longer modiﬁable. The goal is to have all nonfailed entities set, in ﬁnite time, their output registers to the same value d ∈ {0, 1}, subject to the nontriviality condition (i.e., if all input values are the same, then d must be that value). Let us consider next the status of the system and the events being generated during an execution of the solution protocol P . An entity reacts to external events by executing the actions prescribed by the protocol P . Some actions can generate events that will occur later. Namely, when an entity x sends a message, it creates the future event of the arrival of that message; similarly, when an entity sets the alarm clock, it creates the future event of that alarm ringing. (Although an entity can reset its clock as part of its processing, we can assume, without loss of generality, that each alarm will always be allowed to ring at the time it was originally set for.) In other words, as described in Chapter 1, at any time t during the execution of a protocol, there is a set Future(t) of the events that have been generated so far but have not happened yet. Recall that initially, Future(0) contains only the set of the spontaneous events. To simplify the discussion, we assume that all entities are initiators (i.e., the set Future(0) contains an impulse for each entity), and we will treat both spontaneous events and the ringing of the alarm clocks as the same type of events and call them timeouts. We represent by (x, M) the event of x receiving message M, and by (x, ∅) the event of a timeout occurring at x.

THE CRUSHING IMPACT OF FAILURES

419

As we want to describe what happens to the computation if an entity fails by crashing, we add special system events called crashes, one per entity, to the initial set of events Future(0), and denote by (x, crash) the crash of entity x. As we are interested only in executions where there is at most one crash, if event (x, crash) occurs at time t, then all other crash events will be removed from Future(t). Furthermore, if x crashes, all the messages sent to x but not arrived yet will no longer be processed; Similarly, any timeout set by x but not occurred yet, will no longer occur. In other words, if event (x, crash) occurs at time t, all events (arrivals and timeouts) involving x will be removed from all Future(t ) with t ≥ t. Recall from Section 1.6 that the internal state of an entity is the value of all its registers and internal storage. Also recall that the conﬁguration C(t) of the system at time t is a snapshot of the system at time t; it contains the internal state of each entity and the set Future(t) of the future events that have been generated so far. A conﬁguration is nonfaulty if no crash event has occured so far, faulty otherwise. Particular conﬁgurations are the initial conﬁguration, when all processes are at their initial state and Future is composed of all and only the spontaneous and crash events; by deﬁnition, all initial conﬁgurations are nonfaulty. When an arrival or a timeout event occurs at x, x will act according to the protocol P : It will perform some local processing (thus changing its internal state); it might send some messages and set up its alarm clock; in other words, there will be a change in the conﬁguration of the system (because event has been removed from Future, the internal state of x has changed, and some new events have been possibly added to Future). Clearly the conﬁguration changes also if the event is a crash; notice that this event can occur only if no crash has occured before. Regardless of the nature of event , we will denote the new conﬁguration as (C) where C was the conﬁguration when the event occurred; we will say that is applicable to C and that the conﬁguration (C) is reachable from C. We can extend this notation and say that a sequence of events ψ = 1 2 . . . k is applicable to conﬁguration C if k is applicable to C, and k−1 is applicable to k (C), and k−2 is applicable to k−1 (k (C)), . . ., and 1 is applicable to 2 (. . . (k (C)) . . .); we will say that the resulting conﬁguration C = 1 (2 (. . . (k (C)) . . .)) = ψ(C) is reachable from C. If an entity x sets the output register Ox to either 0 or 1, we say that x has decided on that value, and that state is called a decision state. The output register value cannot be changed after the entity has reached a decision state, that is, once x has made a decision, that decision cannot be altered. A conﬁguration where all nonfailed entities have decided on the same value is called a decision conﬁguration; depending on the value, we will distinguish between a 0-decision and a 1-decision conﬁguration. Notice that once an entity makes a decision it cannot change it; hence, all conﬁgurations reachable by a 0-decision conﬁguration are also 0-decision (similarly in the case of 1-decision). Consider a conﬁguration C and the set C(C) of all conﬁgurations reachable from C. If all decision conﬁgurations in this set are 0-decision (respective 1-decision), we say that C ia 0-valent (respective 1-valent); in other words, in a v-valent conﬁguration, whatever happens, the decision is going to be on v. If, instead, there are both 0-decision

420

COMPUTING IN PRESENCE OF FAULTS

C y1

y2

C1

C2

y1

y2

C3

FIGURE 7.4: Commutativity of disjoint sequences of events.

and 1-decision conﬁgurations in C(C), then we say that C is bivalent; in other words, in a bivalent conﬁgurations, which value is going to be chosen depends on the future events. An important property of sequences of events is the following. Suppose that from some conﬁguration C, the sequences of events ψ1 and ψ2 lead to conﬁgurations C1 and C2 , respectively. If the entities affected by the events in ψ1 are all different from those affected by the events in ψ2 , then ψ2 can be applied to C1 and ψ1 to C2 , and both lead to the same conﬁguration C3 (see Figure 7.4). More precisely, Lemma 7.2.2

Let ψ1 and ψ2 be sequences of events applicable to C such that

1. the sets of entities affected by the events in ψ1 and ψ2 , respectively, are disjoint; and 2. at most one of ψ1 and ψ2 includes a crash event. Then, both ψ1 ψ2 and ψ2 ψ1 are applicable to C. Furthermore, ψ1 (ψ2 (C)) = ψ2 (ψ2 (C)). If a conﬁguration is reachable from some initial conﬁguration, it will be called accessible; we are interested only in accessible conﬁgurations. Consider an accessible conﬁguration C; a sequence of events applicable to C is deciding if it generates a decision conﬁguration; it is admissible if all messages sent to nonfaulty entities are eventually received. Clearly, we are interested only in admissible sequences. Proof of Impossibility Let us now proceed with the proof of Theorem 7.2.1. By contradiction, assume that there is a protocol P that correctly solves the problem EFT-Consensus(1, crash, n − 1), that is, in every execution of P in a complete graph with at most one crash, within ﬁnite time all nonfailed entities decide on the same

THE CRUSHING IMPACT OF FAILURES

421

value (subject to the nontriviality condition). In other words, if we consider all the possible executions of P , every admissible sequence of events is deciding. The proof involves three steps. We ﬁrst prove that among the initial conﬁgurations, there is at least one that is bivalent (i.e., where, depending on the future events, both a 0 and a 1 decision are possible). We then prove that starting from a bivalent conﬁguration, it is always possible to reach another bivalent conﬁguration. Finally, using these two results, we show how to construct an inﬁnite admissible sequence that is not deciding, contradicting the fact that all admissible sequence of events in the execution of P are deciding. Lemma 7.2.3

There is a bivalent initial conﬁguration.

Proof. By contradiction, let all initial conﬁgurations be univalent, that is, either 0- or 1-valent. Because of the nontriviality condition, we know that there is at least one 0-valent initial conﬁguration (the one where all input values are 0) and one 1valent initial conﬁguration (the one where all input values are 0). Let us call two initial conﬁgurations adjacent if they differ only in the initial value of a single entity. For any two initial conﬁgurations C and C , it is always possible to ﬁnd a chain of initial conﬁgurations, each adjacent to the next, starting with C and ending with C . Hence, in this sequence there exists a 0-valent initial conﬁguration C 0 adjacent to a 1-valent initial conﬁguration C 1 . Let x be the entity in whose initial value they differ. Now consider an admissible deciding sequence ψ for C 0 in which the ﬁrst event is (crash, x). Then, ψ can be applied also to C 1 , and the corresponding conﬁgurations at each step of the sequence are identical except for the internal state of entity x. As the sequence is deciding, eventually the same decision conﬁguration is reached. If it is 1-decision, then C 0 is bivalent; otherwise, C 1 is bivalent. In either case, the assumed nonexistence of a bivalent initial conﬁguration is contradicted. 䊏 Lemma 7.2.4 Let C be a nonfaulty bivalent conﬁguration, and let = (x, m) be a noncrash event that is applicable to C. Let A be the set of nonfaulty conﬁgurations reachable from C without applying , and let B = (A) = {(A) | A ∈ A and is applicable to A} (See Figure 7.5). Then, B contains a nonfaulty bivalent conﬁguration. Proof. First of all, observe that as is applicable to C, by deﬁnition of A and because of the unpredictability of communication delays, is applicable to every A ∈ A. Let us now start the proof. By contradiction, assume that every conﬁguration B ∈ B is univalent. In this case, B contains both 0-valent and 1-valent conﬁgurations (Exercise 7.10.4). Call two conﬁgurations neighbors if one is reachable from the other after a single event, and x-adjacent if they differ only in the internal state of entity x. By an easy induction (Exercise 7.10.5), there exist two x-adjacent (for some entity x) neighbors A0 , A1 ∈ A such that D0 = (A0 ) is 0-valent and D1 = (A1 ) is 1-valent. Without loss of generality, let A1 = (A0 ) where = (y, m ). Case I. If x = y, then D1 = (D0 ) by Lemma 7.2.2. This is impossible as any successor of a 0-valent conﬁguration is also 0-valent (see Figure 7.6).

422

COMPUTING IN PRESENCE OF FAULTS

C

(C)

A1

A2

...

Ai

A

(A1)

(A2)

...

(Ai)

B

FIGURE 7.5: The situation of Lemma 7.2.4.

Case II. If x = y, then consider the two conﬁgurations E0 = cx (D0 ) and E1 = cx (D1 ), where cx = (x, crash); as both and are noncrash events involving x, and the occurrence of cx removes from F uture all the future events involving x, it follows that E0 and E1 are x-adjacent. Therefore, if we apply to both the same sequence of events not involving x, they will remain x-adjacent. As P is correct, there must be a ﬁnite sequence ψ of (noncrash) events not involving x that, starting from E0 , reaches a decision conﬁguration; as E0 is 0-valent, ψ(E0 ) is 0-decision (see Figure 7.7). As the events in ψ are noncrash and do not involve x, they are applicable also to E1 and ψ(E0 ) and ψ(E1 ) are x-adjacent. This means that all entities other than x have the same state in ψ(E0 ) and in ψ(E1 ); hence, also ψ(E1 ) is 0-decision. As E1 is 1-valent,

A0

A1

D0

D1

FIGURE 7.6: The situation in Case 1 of Lemma 7.2.4.

423

THE CRUSHING IMPACT OF FAILURES

A0

cx

D0

y E0

[0]

A1

[1]

∈A

[0]

[0]

cx

D1

y E1 [1]

y (E 0 )

y (E 1 ) [1]

∈B

FIGURE 7.7: The situation in Case 2 of Lemma 7.2.4. The valency of the conﬁguration, if known, is in square brackets.

ψ(E1 ) is also 1-valent, a contradiction. So B contains a bivalent conﬁguration; as, by deﬁnition, B is only composed of nonfaulty conﬁgurations, the lemma follows. 䊏 Any deciding sequence ψ of events from a bivalent initial conﬁguration goes to a univalent conﬁguration, so there must be some single event in that sequence that generates a univalent conﬁguration from a bivalent one; it is such an event that determines the eventual decision value. We now show that using Lemmas 7.2.4 and 7.2.3 as tools, it is always possible to ﬁnd a fault-free execution that avoids such events, creating a fault-free admissible but nondeciding sequence. We ensure that the sequence is admissible and nondeciding in the following way. 1. We maintain a queue Q of entities, initially in an arbitrary order. 2. We remove from the set of initial events all the crash events, that is, we consider only fault-free executions. 3. We maintain the future events sorted (in increasing order) according to the time they were originated. 4. We construct the sequence in stages as follows: (a) The execution begins in a bivalent initial conﬁguration Cb whose existence is assured by Lemma 7.2.3. (b) Starting stage i from a bivalent conﬁguration C, say at time t, consider the ﬁrst entity x in the queue that has an event in Future(t). Let be the ﬁrst event for x in Future(t). (c) By Lemma 7.2.4, there is a bivalent conﬁguration C reachable from C by a sequence of events, say ψ, in which is the last event applied. The sequence for stage i is precisely this sequence of events ψ. (d) We execute the constructed sequence of events, ending in a bivalent conﬁguration. (e) We move x and all preceeding entities to the back of the queue and start the next stage.

424

COMPUTING IN PRESENCE OF FAULTS

In any inﬁnite sequence of such stages every entity comes to the front of the queue inﬁnitely many times and receives every message sent to it. The sequence of events so constructed is therefore admissible. As each stage starts and ends in a bivalent conﬁguration, a decision is never reached. The sequence of events so constructed is therefore nondeciding. Summarizing, we have shown that there is an execution in which protocol P never reaches a decision, even if no entity crashes. It follows that P is not a correct solution to our consensus problem. 7.2.2 Consequences of the Single-Fault Disaster The Single-Failure Disaster result of Theorem 7.2.1 dashes any hope for the design of fault-tolerant distributed solution protocols for nontrivial problems and tasks. Because the consensus problem is an elementary one, the solution of almost every nontrivial distributed problem can be used to solve it, but as consensus cannot be solved even if just a single entity may crash, also all those other problems cannot be solved if there is the possibility of failures. The negative impact of this fact must not be underestimated; its main consequence is that it is impossible to design fault-tolerant communication software. This means that to have fault tolerance, the distributed computing environment must have additional properties. In other words, while in general not possible (because of Theorem 7.2.1), some degree of fault tolerance might be achieved in more restricted environments. To understand which properties (and thus restrictions) would sufﬁce we need to examine the proof of Theorem 7.2.1 and to understand what are the particular conditions inside a general distributed computing environment that make it work. Then, if we disable one of these conditions (by adding the appropriate restriction), we might be able to design a fault-tolerant solution. The reason why Theorem 7.2.1 holds is that, as communication delays are ﬁnite but unpredictable, it is impossible to distinguish between a link experiencing very long communication delays and a failed link. In our case, the crash failure of an entity is equivalent to the simultaneous failure of all its links. So, if entity x is waiting for a reply from y and it has not received one so far, it cannot decide whether y has crashed or not. It is this “ambiguity” that leads, in the proof, to the construction of an admissible but nondeciding inﬁnite sequence of events. This means that to disable that proof we need to ensure that this fact (i.e., this “ambiguity”) cannot occur. Let us see how this can be achieved. First of all observe that if communication delays were bounded and clock synchronized, then no ambiguity would occur: As any message would take at most ⌬ time, if entity x sends a message to y and does not receive the expected reply from y within 2⌬ time, it can correctly decide that y has crashed. This means that, in 2

Recall that communication delays include both transmission and processing delays.

LOCALIZED ENTITY FAILURES: USING SYNCHRONY

425

synchronous systems, the proof of Theorem 7.2.1 does not hold; in other words, the restrictions Bounded Delays and Synchronized Clocks together disable that proof. Next observe that the reason why in a synchronous environment the ambiguity is removed is because the entities can use timeouts to reliably detect if a crash failure has occurred. Indeed, the availability of any reliable fault detector would remove any ambiguity and thus disable that proof of Theorem 7.2.1. In other words, either restriction Link-Failure Detection or restriction Node-Failure Detection would disable that proof even if communication delays are unbounded. Observing the proof, another point we can make is that it assumes that all initial bivalent conﬁguration are nonfaulty, that is, the fault has not occurred yet. This is necessary in order to give the “adversary” the power to make an entity crash when most appropriate for the proof. (Simple exercise question : Where in the proof does the adversary exercise this power?) If the crash has occurred before the start of the execution, the adversary loses this power. It is actually sufﬁcient that the faulty entity crashes before it sends any message, and the proof does no longer hold. This means that it might still be possible to tolerate some crashes if they have already occurred, that is, they occur before the faulty entities send messages. In other words, the restriction Partial Reliability stating that no faults will occur during the execution of the protocol would disable the proof, even if communication delays are unbounded and there are no reliable fault detectors. Notice that disabling the proof we used for Theorem 7.2.1 does not imply that the Theorem does not hold; indeed a different proof could still work. Fortunately, in those restricted environments we have just indicated that the entire Theorem 7.2.1 is no longer valid, as we will see later. Finally, observe that the unsolvability stated by Theorem 7.2.1 means that there is no deterministic solution protocol. It does not, however, rule out randomized solutions, that is, protocols that use randomization (e.g., ﬂip of a coin) inside the actions. The main drawback of randomized protocols is that they do not offer any certainty: Either termination is not guaranteed (except with high probability) or correctness is not guaranteed (except with high probability). Summarizing, the Single-Failure Disaster result imposes a dramatic limitation on the design of fault-tolerant protocols. The only way around (possibly) is by substantially restricting the environment: investing in the software and hardware necessary to make the system fully synchronous; constructing reliable fault detectors (unfortunately, none exists so far except in fully synchronous systems); or, in the case of crash faults only, ensuring somehow that all the faults occur before we start, that is, partial reliability. Alternatively, we can give up certainty on the outcome and use randomization.

7.3 LOCALIZED ENTITY FAILURES: USING SYNCHRONY In fully synchronous environment, the proof of the Single-Failure Disaster theorem does not hold. Indeed, as we will see, synchronicity allows a high degree of fault tolerance.

426

COMPUTING IN PRESENCE OF FAULTS

Recall from Chapter 6 that a fully synchronous system is deﬁned by two restrictions: Bounded Delays and Synchronized Clocks. We can actually replace the ﬁrst restriction with the Unitary Delays one, without any loss of generality. These restrictions together are denoted by Synch. We consider again the fault-tolerant consensus problem EFT-Consensus (introduced in Section 7.1.4) in the complete graph in case of component failures, and more speciﬁcally we concentrate on entity failures, that is, the faults are localized (i.e., restricted) to a set of entities (eventhough we do not know beforehand which they are). The problem asks for all the nonfaulty entities, each starting with an initial value v(x), to terminally decide on the same value in ﬁnite time, subject to the nontriviality condition: If all initial values are the same, the decision must be on that value. We will see that if the environment is fully synchronous, under some additional restrictions, the problem can be solved even when almost one third of the entities are Byzantine. In the case of crash failures, we can actually solve the problem tolerating any number of failures. 7.3.1 Synchronous Consensus with Crash Failures In a synchronous system in which the faults are just crashes of entities, under some restrictions, consensus (among the nonfailed entities) can be reached regardless of the number f of entities that may crash. The restrictions considered here are Additional Assumptions 1. 2. 3. 4. 5.

Connectivity, Bidirectional Links; Synch; the network is a complete graph; all entities start simultaneously; the only type of failure is entity crash.

Note that an entity can crash while performing an action, that is, it may crash after sending some but not all the messages requested by the action. Solution Protocols In this environment there are several protocols that achieve consensus tolerating up to f ≤ n − 1 crashes. Almost all of them adopt the same simple mechanism, Tell All(T ), where T is an input parameter. The basic idea behind the mechanism is to collect at each nonfaulty entity enough information so that all nonfaulty entities are able to make the same decision by a given time. Mechanism Tell All (T ) At each time step t ≤ T , every nonfailed entity x sends to all its neighbors a message containing a “report” on everything it knows and waits for a similar message from each of them.

LOCALIZED ENTITY FAILURES: USING SYNCHRONY

427

TellAll-Crash. begin for t = 0, . . . , f do compute rep(x, t); send rep(x, t) to N (x); endfor Ox := rep(x, f + 1); end

FIGURE 7.8: Protocol TellAll-Crash.

If x has not received a message from neighbor y by time t + 1, it knows that y has crashed; if it receives a message from y, it will know a “report” on what y knew at time t (note that in case of Byzantine faults, this “report” could be false). For the appropriate choice of T and with the appropriate information sent in the “report,” this mechanism enables the nonfaulty entities to reach consensus. The actual value of T and the nature of the report depend on the types and number of faults the protocol is supposed to tolerate. Let us now see a fairly simple consensus protocol, called TellAll-Crash and on the basis of this mechanism, that tolerates up to f ≤ n − 1 crashes. The algorithm is just mechanism Tell All where T = f and the “report” consists of the AND function of all the values seen so far. More precisely, rep(x, t) =

if t = 0 , AND(rep(x, t − 1), M(x1 , t), . . . , M(xn−1 , t)) otherwise

v(x)

(7.2)

where x1 , . . . , xn−1 are the neighbors of x and M(xi , t) denotes the message received by x from xi at time t if any, otherwise M(xi , t) = 1. The protocol is shown in Figure 7.8. To see how and why protocol TellAll-Crash works, let us make some observations. Let F be the set of enties that crashed before or during the execution of the protocol, and S the others. Clearly, |F | ≤ f and |F | + |S| = n.

Property 7.3.1 on 1.

If all entities start with initial value 1, all entities in S will decide

Property 7.3.2 If an entity x ∈ S has or receives a 0 at time t ≤ f , then all entities in S will receive a 0 at time t + 1. Property 7.3.3 If an entity x ∈ S has or receives a 0 during the execution of the protocol, it will decide on 0.

428

COMPUTING IN PRESENCE OF FAULTS

These three facts imply that all nonfailed entities will decide on 0 if at least one of them has initial value 0 and will decide on 1 if all entities have initially 1. The only case left to consider is when all entities in S have initially 1 but some entities in F have initially 0. If any of the latter does not crash in the ﬁrst step, by time t = 1 all entities in S will receive 0 and thus decide on 0 at time f + 1. This means that the nonfailed entities at time t = f + 1 will all decide on 0 unless 1. up to time f they have seen and received only 1; and 2. at time f + 1 some (but not all) of them receive 0. In fact, in such a case, as the execution terminates at time f + 1, there is no time for the nonfailed entities that have seen 0 to tell the others. Can this situation occur in reality ? For this situation to occur, the 0 must have been sent at time f by some entity yf ; note that this entity must be in F and crash in this step, sending the 0 only to some of its neighbors (otherwise all entities in S and not just some would have received 0 at time f + 1). Also, yf must have initially had 1 and received 0 only at time f (otherwise it would have sent it before and as it had not crashed yet, everybody would have received it). Let yf −1 be one of the entities that sent the 0 received by yf at time f ; note that this entity must be in F and crashed in that step, sending the 0 only to yf and other entities not in S (otherwise all entities in S would receive 0 by time f + 1). Also, yf −1 must have initially had 1 and received