Data Management: Databases and Organizations Richard Watson

  • Published on

  • View

  • Download


Data Management: Databases and Organizations Richard Watson. Summary of Chapter 7 and Basic Structures prepared by Kirk Scott. Data Modeling and SQL. Chapter 7. Data Modeling Reference: Basic Structures. Chapter 7. Data Modeling. - PowerPoint PPT Presentation


Slide 1Data Management: Databases and OrganizationsRichard WatsonSummary of Chapter 7 and Basic Structures prepared by Kirk Scott1Data Modeling and SQLChapter 7. Data ModelingReference: Basic Structures2Chapter 7. Data ModelingThe building blocks of data modeling should be familiar to you:EntitiesAttributesRelationshipsIdentifiers (keys)The next five overheads taken from chapter 7 review the ER notation for these things345678A model is the starting point for creating a databaseNo table need be created before the model is completeQuality of the data model is essentialThe model should be well formed: It should follow the basic rules for entities, attributes, relationships, and keysThe following overhead summarizes the characteristics of a well formed model910A quality data model should be high-fidelityThis means that it has to accurately and completely model the situation in the problem domainA model which is well formed but does not model the problem domain is useless from a practical point of view11The phrase quality improvement in the context of data models means this:It is unrealistic to assume that a good data model can be created on the first tryA data model will evolve as technical mistakes are caughtMore importantly, it will evolve as a result of interaction with users as the problem domain and requirements are more completely understood12The Stock ExampleA simple data model for nations and stocks is given on the next overheadSuperficially, it seems OKIt could be verbally summarized as Nations have stocks1314The book now introduces the following additional textual informationStocks are listed on stock exchanges (a new entity)A nation may have >1 stock exchangeA given stock may be listed on >1 exchange, but it has 1 home exchange15Stocks can be listed on the exchanges of >1 countryNotice that the abstraction of a listing is repeated in this descriptionThat suggests that a listing itself will be an entityThe next overhead shows a revised model that takes into account the new assumptions1617The Geography ExampleNext the book gives a simple example thats supposed to model the relationships between nations, administrative units (states), and citiesSee the next overhead for a straightforward model of this1819The book next observes that exceptions are the bane of a good modelIf you presume to model these globally, then your model should accommodate all possible situationsThe book asks, How many errors can you find in the initial data model?See the table on the following overheads for answers202122The next overhead shows a nations, administrative units, cities data model that has been revised to take into account these exceptional cases/errors in the initial modelThis revised model may seem needlessly complexHowever, the complexity is not needlessThis is an accurate model of the situation that covers all casesThe initial model was insufficiently complexIt was wrong2324The Women, Men, Marriage, and People ExamplesThis topic was brushed on all the way back in unit oneCapturing the relationships among people is a very common problem that leads to some familiar challenges and design/model choicesOn the following overhead is an ER diagram of the relationship between married men and women2526The foregoing model is obviously hilariously limited in the kind of relationship it can captureIn addition, the book points out the following characteristics of the model which might indicate that a different model would be better1. The labeling of the model indicates that this is a marriage, but there is nothing in the fields that spells this outIn particular, you might think that there would be a date field, a marriage license number, something among the fields that was specific to marriage272. The Man and Woman tables have the same set of attributes, different only in their being name manX or womanXThis might suggest that we are dealing with one entity type, person, rather than two distinct entity types, man and woman283. The last observation concerns the fields manoname and womanonameThese stand for other nameAs the model stands, a person can only have one other nameAlternatively, if the other name field is text, it might be filled with multiple valuesnot an ideal solutionA complete treatment of people and other names might introduce another table so that there could be a one-to-many relationship between people and their various names29The book doesnt solve all of these problems, but it does come up with a second modelIf there were two types to begin with and you combine into one, you frequently get a new field in the resultA person now has a gender fieldAlso, the labeling of the relationship could be made more genericSee the following overhead3031Next the book tackles the topic of multiple marriagesIf youre dealing with a Person table, then the table is in a many-to-many relationship with itselfTo distinguish between multiple marriages, potentially between the same partners, beginning and ending date fields can be added to the table in the middleSee the next overhead for the third version of the model3233In the long run, some sort of arbitrary numbering scheme might be desirableA marriage license number might work, but the book points out that legally speaking it might also be desirable to record common law marriagesNotice in general that a lot of data integrity questions start to arise with a model like thisSee the next overhead for the fourth version of the model3435Next the book considers adding children to the modelChildren are modeled as the result of marriageOf course, this is not always the caseAs long as the marriageno field in the person table can be null, the model accommodates thatStill, it doesnt allow you to record who a persons parents are if the person wasnt the result of marriageSee the next overhead for the fifth version of the model3637The person model could be developed even furtherThis model barely scratches the surface of the variety of human relationshipsIt is already moderately complex but could become more complex38A model is complete when it contains everything needed in practice for a given problemThe model is unsuitable if it isnt complex enoughIt is also unsuitable if it contains detail that isnt needed39The Book ExampleThe book entitles this Whens a book not a book?In other words, the example is an invitation to clarify what you mean when you refer to entities in a designAre you referring to individual objects?Are you referring to kinds of objects?What elements of a design make it possible to distinguish between these meanings?A simplistic initial design is given on the next overhead4041The book observes that a library may have more than one copy of a bookYou might be tempted to model this by adding a copy number to the book recordThe problem with that solution is that the basic book information would be repeated for every copyThe solution is to treat a book as an abstract entity and a copy as a separate, concrete entitySuch a design is shown on the next overhead4243You may have noticed that although the ISBN should be a unique identifier for a book (not a copy) it is not used as a primary key in these designsThe problem is that books before a certain date did not have ISBNsAlso, you may have hand-crafted modern books that werent commercially published and dont have ISBNs44The Employment History ExampleThis example starts out simply enoughA given company has divisionsThe divisions have departmentsDepartments have employeesThis is shown in the ER diagram on the next overhead4546Next, the author observes that over time a given employee may hold different positionsThese positions may be in different departmentsLike marriages, the distinguishing features of positions may include a beginning and ending dateThis is shown in the ER diagram on the next overhead4748Next the author introduces the concept of a payslip into the record-keeping that the model includesIts not fully fleshed out in the next example, but when you look at the diagram you may have an inkling that the treatment of payslips is reminiscent of the treatment of line itemsThis is shown in the ER diagram on the next overhead4950The final design treats payslips exactly like the line item exampleA payslip is like a bill of salePay slip text is like an itemThe table in the middle, PaySlipLine, is like LineItemThe pk of PaySlipLine is the concatenation of the pk of Payslip embedded as a fk, plus a pay slip number (payslipno)The pk of PslText is embedded separately as a fkThis is shown in the ER diagram on the next overhead5152The Aircraft Leasing ExampleIn the previous set of overheads the first design containing a cycle cropped upThis example also contains a cycleThere are three base tables and three tables in the middleEach of the base tables is in a many-to-many relationship with each otherOverall, the tables are in a many-to-many-to-may relationshipThis is shown in the ER diagram on the next overhead5354How to properly model a situation becomes an important question in the next chapter, on normalizationIn the meantime, the following observation can be made:An aircraft lease is an abstract entity that seems to be part of the business problemHowever, it doesnt appear in the designThis isnt just a problem in a theoretical sense55First of all its clear that in order to get complete information about a lease from this design a 6-way join would be neededThats inconvenientAlso, leases themselves may have attributes like starting and ending datesThere is no place to record themAn improved, star-like design for the problem is shown on the next overhead5657The Project Management ExampleThis example addresses the question of where something could or should be modeledIt impinges on the question of how the model has to be changed to capture a more detailed business situationThe first model is given on the next overheadIt should be relatively self-explanatory5859Now consider the altered model on the following overheadThe planned hours attribute has been moved from the Activity entity to the Daily Work entityThis small change in location of a field has a clear and logical outcomeThe planning of project hours is done on a daily basis, not an activity basis6061Cardinality and ModalityCardinality refers to the count of the number of instances of entities in a relationshipModality is a fancy way of saying that there can be 0 entities in a relationshipIn other words, one end of a relationship is optionalThis condition obtains, for example, when a pk in one table has not fk entries in anotherIt also obtains when a fk value is null62The book gives the table shown on the next overhead summarizing cardinality and modality6364The author now enhances the notation for ER diagramsUnlike UML, it is not customary to mark actual digits at the ends of crows feetInstead, a short vertical bar marks the end of a relationship where an instance of an entity is mandatoryAn o marks the end of a relationship where an instance of an entity is optional65The Nation and Stock ExampleThe following diagram of the 1-m relationship between nations and stocks illustrates this new notationNation has a barStock has an oEvery stock has to have a nationThe nation code field in the Stock table cant be nullA nation doesnt have to have a stockThere can be nation code values in the Nation table where no such nation code appears in the Stock table6667The Sale, Item, and Lineitem ExampleThe following diagram of the m-n relationship between sales and items also illustrates this new notationA sale has to have at least one line itemA line item has to belong to a saleA line item has to have an itemAn item doesnt have to be part of a sale68These verbal statements can be translated into null/not null and existence/non-existence requirements for fields and rows in tablesThe new thing illustrated by this example is that if you have a row for a sale in the Sale table, the ER diagram now states that it has to have a corresponding record in the Lineitem tableThis is not something that can be enforced by the database using referential integrity, for exampleIt is a new kind of data integrity constraint6970The Department and Employee ExampleThe following diagram of the 1-m and 1-1 relationships between departments and employees also illustrates this new notationAn employee has to have a departmentA department doesnt have to have employeesA department has to have a bossAn employee doesnt have to be the boss of a department71It is worth paying close attention to the 1-1 relationshipIt looks a little odd to have a line with no crows foot with a bar at one end and an o at the otherRecall that in order to reduce the number of nulls, the 1-1 relationship was captured by embedding the pk of Employee as a fk in DepartmentThe notation means that the fk cant be nullIt also means that not every pk of Employee has to appear as a fk value7273Recall that when presented earlier, the Employee-Department diagram grew to include the (recursive) relationship telling which employee was which other employees bossIn the following diagram, this line has os at both endsThis means its possible to have employees who are not bossesIt also means that the embedded fk field can be nullIn other words, there can be employees who dont have bosses7475The Monarch ExampleThe monarch succession relationship can also be marked for modalityThe first monarch would have no predecessorThe current monarch would have no successor (yet)Both ends of the relationship are optionalThis is shown in the ER diagram on the following overhead7677The Product Assembly ExampleModality can also be added to the product-assembly exampleIf there is an assembly entry, it has to have a super-productLikewise, if there is an assembly entry, it has to have a sub-productOn the other hand, there can be products that are neither super-products nor sub-products78It is interesting to note that in this situation the vertical bars repeat information that can be inferred from the rest of the diagramThe + signs on the crows feet mean that the embedded foreign keys are also primary keysAs primary keys, they cant be nullAs foreign keys, referential integrity states that their values have to occur in the corresponding primary key tableTherefore, the corresponding super-product or sub-product entry has to existIt is mandatory7980Entity TypesThe author categorizes entities into the following types:IndependentWeak or dependentAssociativeAggregateSubordinate81Independent entitiesThe following ER diagram shows two independent entitiesInstances of each can exist regardless of the existence of matching instances of the otherAlthough a pk is embedded as a fk, the pk may have no matches and the fk may be nullIndependent entities are usually the easiest base tables to recognize in a problem domain8283Weak or Dependent EntitiesA weak entity is one where the pk of another table is embedded as a fk in itAnd the fk is part of the primary key of the dependent tableBecause the pk of the weak entity cant be null, in its role as a fk, that field has to have a corresponding pk value in the other tableIn other words, an instance of a dependent entity simply cant exist without the existence of a matching instance in the other tableIn the following ER diagram, cities cant exist without their corresponding regions8485Associative EntitiesAssociative entities have already been explainedThis is an alternative name for the table in the middleThe table in the middle may or may not have a concatenated keyIt may or may not have attributes of its own86If the table in the middle does have attributes, in practice they are frequently date or time attributesThis makes it possible to keep track of multiple pairings of the same base entities over timeThe following ER diagram gives Position as the table in the middle8788Aggregate EntitiesThe book doesnt have a diagram for this conceptIt explains it verballyCustomers and suppliers are two different entities which might both have addressesAddress information could be broken out of both89Once it is broken out, there is no reason to have two different kinds of addressAnd address is an address, and both the Customer and Supplier tables could be in a relationship with an address (address line) tableThe reason there is no diagram is that once model analysis is complete, the aggregate table simply becomes another base table90Subordinate EntitiesSubordinate entities are entities are entities which are a more detailed kind of some other entityIn other words, the main entity holds attributes common to all different kindsThe subordinate entity hold attributes for a specific kindYou know you have a subordinate entity when the pk of one table is the pk of the otherThe following ER diagram illustrates the idea with animalsNotice that the relationships are one-to-one with a + sign9192Generalization and AggregationFirst, keep in mind that the use of the term aggregation here is different from its use in the phrase aggregate entityAlso note that the author is now introducing object-oriented ideasThis makes it possible to compare ER notation with UML notation93The following UML diagram captures the relationship between animals, sheep, and horse that was illustrated in the previous ER diagramAnimal is a generalization of the other two kinds of animalsTogether, the different kinds of animals form a hierarchy of the type is-a or is-a-kind-of which should be familiar from the object-oriented world9495AggregationAggregation and composition are usually treated together in object-orientationAggregation captures a has-a or containment relationshipIn UML it is symbolized by a diamondThe diamond is less intuitive than the crows foot, but they are roughly equivalentThis is illustrated on the next overhead9697Aggregation and Composition; One-to-Many and Many-to-Many RelationshipsIn UML, the term aggregation is usually described as a simple has-a relationship and is symbolized with a white diamondComposition is usually described by a phrase like, the parts cant exist without the whole and is symbolized by a black diamondThese concepts translate at least in part into relational database concepts and ER diagrams98The translation between object-oriented and relational isnt perfect thoughObject-oriented code can have referencesAll relationships in the relational model are captured by the values of fieldsThe diagrams on the next overhead show the relationship between classes and studentsThey will be followed by commentary99100In the UML diagram a white diamond is usedStudents can exist without enrolling in any classesMore importantly, in the UML diagram there is no Enrollment classA many-to-many relationship can be captured using references alone101In the ER diagram there is an Enrollment class to capture the many-to-many relationshipThis is a classic table in the middle with a concatenated primary key, indicated by the + signsAs such, enrollment records are dependent102Enrollment records cannot exist without corresponding student and class recordsIf there were an Enrollment class in the O-O model, it would be an example of composition, not aggregationFor further explanation, see the next example103Next the book illustrates the relationship between students an aptitude testsThe key to the diagrams is the relationship label takenAptitude tests themselves can exist without students104However, the class/entity labeled Aptitude test actually means a specific aptitude test scoreThe diagram is given on the next overheadMore explanations follow it105106The point is that specific aptitude test scores cant exist without the student who took the test and got that scoreIn the UML diagram the relationship is shown with a black diamondThe part cant exist without the wholeIn the ER diagram there is a crows foot with a + signThis means that an aptitude test record cant exist without a corresponding student record107Data Modeling HintsThe book next addresses these subpoints:The rise and fall of a data modelIdentifierPosition and orderAttributes and consistencyNames and addressesSingle instance entities108Picking wordsSynonymsHomonymsException huntingRelationship labelingKeeping the data model in shapeUsed entities109The rise and fall of a data modelThe book points out that a model will both grow and shrink as it developsDiscovering new entities will cause it to growTrying to handle greater specificity will cause it to growGeneralization happens when you recognize useful commonalityThis will cause a model to shrink in a useful wayConsider the diagrams on the following overheads110Growth, specificity111Generalization, shrinkage112IdentifierThe basic rule, except in cases where a simple concatenated key works:If there is no obvious identifier (pk), simply make up an arbitrary oneConsecutive numbering by entry order would be a simple choiceNotice that packages like Access have features like this113There is an irony to such helpful featuresThey are most likely to be used by people who dont even know what a pk is, and they will end up making a confusing messFor more informed users, the feature isnt really necessary, and theyre more likely to want full control over the values entered anyway114Position and OrderThe concepts of position and order apply to both the ER diagram of the model and the contents of tablesThe general rule for presenting a model is to be organizedThe most important base entities might appear in the center, at the top, starting at the leftsomewhere, anywhere where they arent hidden as afterthoughtsAlso, arranging things so that lines dont cross is important for understanding115The important point is that all entities and relationships be correctly identifiedSimilarly, there is no required order to the attributes in an entityHowever, common sense dictates being consistent, putting the pk first, listing more important attributes nearer the topIn the authors notation, fks dont appearI still recommend that they be included, marked with fk so that they cant be overlooked116As usual, the rows in a table are not stored in sorted orderWhen picking fields for a table you want to keep in mind any ordering that you might eventually want produced by a queryThere has to be a field for the ORDER BY if you want the data presented in that order117On the next overhead the monarch data model is shown againThere is an implicit ordering to the data based on the valuesIt is interesting to consider whether you could write a query that would show the monarchs in succession orderIt seems that this would have to procedural, like the recursive query to find all products in a given productThe solution to the problem would be a design where the monarchs were simply numbered in order118119Attributes and consistencyIn simple terms, if you use the same field name in different tables, it should have the same meaning in the different tablesWithin a single table, the field should also have exactly the same meaning for every rowThe books example of what shouldnt be done with attributes is outlined on the next overhead120Let there be an attribute stock info that stores either the stocks price to earning ratio or its return on investmentWhich value is held in that field is determined by whether the value 1 or 2 appears in another field named stock info codeThis is very unfortunateOne field in a table now depends on anotherThe meaning of the dependent field varies from record to record121Names and addressesNames and addresses frequently occur in databasesAlthough not incredibly hard, their treatment is usually a little more complex than what you might think at first glanceThere are several basic rules that applyHave you subdivided into sufficiently small fields?Can you handle multiple occurrences?Can you construct something that consists of multiple parts?122Although SQL has string operators that allow you to form queries based on subparts of fields, it is not wise to depend on thisFor example, in the long run it is easier and more logical to have first name, middle name, and last name fields in place of one monolithic name fields123The book also mentions the question of including titles with names (Mr., Mrs., etc.)There is also the question of suffixes (jr., III, and so on)There are those people who have multiple given names (George Herbert Walker Bush)There are also those people who have different names married than when they were singleOr there are people who have changed their names legally or simply use aliasesThe point is that a complete design will handle all of these cases124The question of addresses is not really more difficult, but it is somewhat less familiar than namesThe fundamental problem is that addresses can take many different formsDepending on the organization, an address may be many lines longDepending on the country, an address might come in an uncustomary order125The handling of addresses was mentioned earlierIt has something in common with line items and pay slips in its most complete treatmentThe gives the model on the next overhead as a reminder126127Finally, people might have more than one addressA home address vs. a school addressA mailing address vs. a residential addressThis doesnt add a great deal of complexityits just a one-to-many relationshipThe book gives the ER diagram on the following overhead to illustrate128129Single Instance EntitiesThe moral of this story is that single instance entitiesone row tablesare not a crimeIn the example shown below there would be one firm listedQuite simply, this allows firm information to be storedIt also makes it easy if two hold information about >1 firm if there happens to be a merger130131Picking wordsThe moral of the story here is to base the model on the vocabulary of the usersIt is important to root out inconsistencies in the users vocabulary if there are anyHowever, cramming a different vocabulary down their throat wont work132SynonymsSynonyms in data modeling are just like regular synonymsDifferent users or different groups of users use different words for the same thingSynonyms are not a technical problemYou may get users to agree on one wordYou may also provide different views with different vocabularies133HomonymsHomonyms in data modeling are just like regular homonymsDifferent users or different groups of users use the same words for different thingsHomonyms are not a technical problem, but they are a big practical problem134They cause ambiguity and confusion and have to be tracked downOnce identified, they are easy to fixQualify or expand the names of things so that they are distinguished from each other135Exception huntingWhen working with clients (including yourself) ask these questions:Is it always like this?Would there be any situations where this could be an m:m relationship?Have there ever been any exceptions?Are things likely to change in the future?A good data model should be able to handle exceptional cases136Relationship labelingThe book recommends avoiding relationship labels because they tend to clutter up an ER diagramIt is true that most 1-m relationships should be clearHowever, if any relationship is unclear, it should be labeled137Keeping the data model in shapeAn illustrative example of this idea is very simpleAs you work on a model, you might add an entityIf you do so, do not forget to work out its identifier and attributes before moving on to something elseMaking incomplete additions will quickly turn a model into a mess138Used entitiesDeveloping models is just like writing codeIf you have an earlier model or someone elses model (that you trust) as a starting point, work from thereThere is no need to start from scratch every time139Meaningful identifiersThis is the next major subsectionIt mentions some things to avoid and some things worth trying when picking identifiers, that is, when setting up primary key fieldsThe phrase meaningful identifier means that you can read the key value and find out something useful about the recordThe complete opposite would be randomly generated identifiers140In simple cases, meaningful identifiers might seem like an attractive optionThat would be memorable for usersThe might be simple to administer141However, they have disadvantagesIf the reality youre modeling becomes complex, the identifiers are no longer easy to remember or administerIf they are based on ranges of values, you may exhaust the available rangesIf the underlying reality changes, previously meaningful identifiers lose their meaning142Some large organizations have embedded codes into identifiersVins contain certain identifiable partsUPC codes also contain identifiable partsIf organizations choose to do this, its their businessHowever, no independent organization has to go down this path143The general rule is that the disadvantages of meaningful identifiers outweigh the advantagesEverything that could be coded into an identifier could be, and probably is recorded in an attribute fieldThis means that youve returned to a situation where one field is dependent on anotherThe possible result is inconsistency between the information coded in the identifier and the data recorded in the fieldsWhether random, entry order, or some other scheme, non-meaningful identifiers are preferable144The seven habits of highly effective data modelersThis sounds like a bunch of management bullshit, but if you have to do modeling for clients, these are worthwhile hints:1. ImmerseAs a computer person, when modeling for someone else, you have to learn their problem domain and terminology before you can make a good model.1452. ChallengeThis means challenge the assumptions and find the exceptions3. GeneralizeThis means, when possible, to merge entities together so the model doesnt proliferate1464. TestHave structured walk-throughs at a detailed level, checking entities, attributes, and especially, relationships5. LimitThis means let the project drive the modeling; dont do modeling for the sake of modeling. The model is supposed to lead to practical results. If necessary apply the 80-20 rule in order to control the process.1476. IntegrateThis means that modeling doesnt happen in a vacuum. If an organization has existing systems, fit the new model into the existing one.7. CompleteWhatever limits youve set for yourself, complete the model within those limits. Few things are more worthless than a model that hasnt been finished.148Reference: Basic Structures149The next set of overheads will be given without textual commentaryThis is essentially a repetition or review of the concepts that have been raised in chapters 3 through 7 in this set of overheads and the previous oneThis review will be followed by a short section relating object-oriented and relational designNo relationships150A 1:1 recursive relationship151A recursive 1:m relationship152A recursive m:m relationship153A 1:1 relationship154A 1:m relationship155A m:m relationship156A weak or dependent entity157An associative entity158A tree structure159Another approach to a tree structure160ExercisesThe basic structures chapter ends with exercises where the directions are to write the SQL CREATE statements for the designs shownThe designs came up in the chaptersThey are repeated here just for referenceYou should recognize them and know what kind of structures they represent161162163164165166Ignore the remainder of the overheadsMaterial has been taken from another book and included hereHowever, it will not be coveredIt is simply kept here for future reference167Developing Software with UML: Object-Oriented Analysis and Design in PracticeBernd OestereichChapter 2, Object-Orientation for BeginnersSection 2.13, PersistenceSynopsisPersistence is the storing of objects on a non-volatile mediumThere is no one-to-one mapping to relational databases168169170171172The End173


View more >