Working group members: Kevin Kishimoto, Lisa McFall, James Soe Nyun (group leader).
The BIBFRAME Task Force working group 4.3 was tasked to examine the MARC-to-BIBFRAME transformation of uniform titles. Current conversion tools from the Library of Congress and Zepheira focus on converting metadata in MARC bibliographic records, and the group likewise focused on transforming bibliographic records.
Our tests revealed many things that worked well in one or the other converter, along with many operations that either produced erroneous results or revealed features that are not yet fully developed. We took the results to help us formulate what we think an ideal conversion tool should do.
The conversion tool needs to output a string that duplicates the original access point string in content, punctuation and order, minus the MARC subfield codes. Map the complete output string as the object value of bf:authorizedAccessPoint.
When the source creator/title string is split between the 1xx and 240 field, the complete string for creator from the 1xx field and the complete string for title from the 240 must be reassembled into an access point string. Supply appropriate punctuation between creator and title elements. This string should then be the value for the bf:authorizedAccessPoint property.
When the access point comes from any place other than the bibliographic main entry, create a separate bf:Work statement for it. Supply the complete authority string in bf:authorizedAccessPoint. Within the main resource bf:Work statement, supply the appropriate BF property (e.g., bf:hasPart, bf:relatedResource, bf:subject) to associate the main work to the separate work.
Harvest the creator portion of creator/title strings and generate a statement for it, identified as a bf:Person, bf:Organization or bf:Event. All values for subfields associated with the creator should be harvested, retaining order and punctuation, to construct a creator string, which is mapped to the bf:authorizedAccessPoint within the statement for person, organization or event.
Reference the preceding creator within the bf:Work statement where the complete access point string resides. Use the property, bf:creator.
When the original content supplies an identifier for the access point, the identifier should remain associated with it. If possible, use an algorithm to map to bf:identifier if the identifier appears to be machine-actionable, to bf:identifierValue if not.
When the input contains 880 fields with parallel graphic representations of a single work the converter needs to associate the different forms of the access point string.
Use the presence of second indicator 2 in MARC 700, 710, 711 and 730 to map a work into bf:hasPart.
Harvest external identifiers when available for each authorized access point. If an access point already has an identifier, add the new harvested identifier unless it duplicates the existing one. Supply the property bf:identifier.
Extract bf:label by duplicating bf:authorizedAccessPoint string.
When item has a 240 extract bf:Title using the original MARC elements to decide on which elements should map into the title. (Current tools are lossy.)
Map all MARC subfields that correspond directly to properties in BF: Map $r to bf:musicKey; $o to bf:musicVersion, $m to bf:musicMediumNote, $p to bf:partTitle, $l to bf:languageNote (and if possible through algorithmic matching, to a URI in bf:language). Values from the 240 should map into the bf:Work entity of the main resource; others should map into the bf:Work locations for the related or contained works.
MARC tag 383 allows for a fine breakdown of different types of music numbers associated with a resource. These types have no specific mapping within BF than to the less granular bf:musicNumber. We would like a way to be more specific, probably by linking to external, more specific vocabulary within the bf:musicNumber property. Many Bibliographic records will lack the 383 field, and it would be good to harvest values from the $n of an access point string (or imported from external Work authority records). This subfield unfortunately serves many purposes in access points for musical works, acting to define a music number, a date of creation, or the number of a part of a work. Still, if a converter can be sensitive to the punctuation of the original string, we feel that the $n could map reliably to bf:musicNumber, bf:partNumber, or bf:originDate, based on its relation to adjacent punctuation:
$n following a comma : map to bf:musicNumber
$n includes an opening parenthesis, followed by a date, followed by a closing parenthesis : map date to bf:originDate
$n following a period : map to bf:partNumber
In many cases it would be possible to extract more granular information from the string in bf:musicNumber based on content and how it is formulated, and it is worth further discussing how more specific properties like opus number or serial number might be detected and extracted. As with our recommendation for the 383, it would probably make most sense to link out to an external vocabulary with these more granular terms to define the type of the property.
Associate 1xx with 245 $a if no 240 is present in MARC record. Create bf:authorizedAccessPoint.
Retain or transform relationship designators when present in the source to supply work-to-work relationships. bf:hasPart can be extracted from the MARC second indicator 2, but there could be more granular information in the $i that should be preserved, probably as a link to external vocabulary within bf:hasPart. Other relationships could be recorded within the bf:relatedResource property, where the only current subproperties are precedes and succeeds, or in the bf:derivativeOf or bf:derivedFrom properties, or in the several other predefined BF properties. This could probably be accomplished by mapping certain designators to corresponding BF properties. For other, undefined relationships, prefer referring out to external vocabulary such as RDA relators rather than devising new BF terms, perhaps also harvesting external URIs for these concepts.
Future needs, possibilities
Transformation tools for authority records…
A way to round-trip metadata back into MARC, or convert natively-created BIBFRAME metadata into a usable MARC bib record… (Interestingly, the Zepheira tool brings over some information in the original MARC field- and subfield-structure. This is not so useful from a BF standpoint, but it would assure a safer return trip back to MARC.)
Algorithms could be used to mine deeper into data elements within the AAP string and form more definite relationships. A tool could possibly extract the identifiers for instruments named within bf:musicMediumNote, and then record that information in bf:musicMedium. Since some AAPs strings are based on musical forms, an algorithm could supply a fairly reliable form or genre form; there would would be false drops, so discussions would need to take place as to whether this would a desired direction to go.
Explore ways to make use of values in work AAP subfields $3 and $5.
Appendix 1: Bugs with the LC and Zepheira Converters
The $n in 700, 710, 711 fields is appended to agent’s name access point, in addition to being in the title of the work.
No content from the 440 field is converted.
No content from the 630 field is converted.
The $g element extracted from the 710 AAP maps incorrectly into the bf:treatySignator statement for the main 245 Work.
Elements extracted from the 730 AAP map incorrectly into the statement for the main 245 Work; this includes subfields $m (into bf:mediumNote), $n (into bf:partNumber), $d (into bf:legalDate)
Not all identifiers ($0, $x) in work access point strings are converted. Nothing is done with $0 identifiers in 700, 710, 711, 730, 800, 810, 811 or 830; those in 240, 400, 600, 610 and 611 are converted. The $x ISSNs in 400, 410, 411, 800, 810, 811 were mapped; those in 430 and 830 were not.
Conversion of 130 had several problems: First, it leaves off $o; other subfields were not tested, but may have similar issues. It also creates a bf:label and bf:authorizedAccessPoint that are comprised of the first 700 author prepended before the $a of the 130; $o from the 130 is left off for sure, and other subfields may be impacted.
The $r from the 240 is mapped appropriately into the bf:musicKey of the main work; however it is left out of the bf:authorizedAccessPoint, bf:label and associated bf:Title statements.
Although the converter appears to correctly harvest associated identifiers for names and topics, it isn’t finding identifiers for works when identifiers are known to exist.
Language $l converts into bf:authorizedAccessPoint only from fields 600, 610 and 611. It is stripped from the bf:authorizedAccessPoint of the 240, 400, 410, 411, 440, 700, 710, 711, 730, 800, 810, 811 and 830.
When the converter encounters a parallel graphic representation of an access point it pulls a language code from the MARC 008/35-37 into xml:lang. This is an unreliable way to try to find what language the alternate representation is in, with catalogers supplying “zxx” (no linguistic content) for instrumental musical works. Also the language bytes are supplied by the cataloger to apply to the main resource and the language may not be appropriate for a component. The language of cataloging in the 040 also would not be a dependable source. We don’t see a reliable way to populate the attribute for language fo title.
In 600, 610 and 611 fields, subfields $f, $3 and $4 are incorrectly carried over into the bf:authorizedAccessPoint and the madsrdf:authoritativeLabel properties.
When a single MARC bib record contains multiple 7xx fields representing multiple Works by the same Creator, duplicate bf:Agent entities are created, even though the bf:authorizedAccessPoint for these Agents match exactly.
We didn’t test this tool as exhaustively as LC’s, but below are some of the things we noticed.
Extracts bf:musicKey, bf:musicMedium (incorrectly, see below), bf:language and bf:arrangedMusic. However, largely fails to map MARC into BF, with much of the original MARC subfielding carrying over into the BF output. The largest failing is that no equivalent to an authorized access point is generated that could be used for either display or matching to external identifiers.
The $m maps to bf:musicMedium, which expects an identifier, rather than bf:musicMediumNote, which can accommodate the strings found in $m.
The converter makes no attempt to match the names and works against an external name authority and extract identifiers.
When names are extracted from a name/title AAP, most subfields appropriate to the title portion are also placed in the name string.
When multiple $n occur in a single heading, the converter is inconsistent in the order which they are output as part of the rdfs:label, sometimes switching the order; this behavior seems to be unpredictable. The multiple $n do seem to be predictably retained as separate data elements in ns2:titleNumber, but not always in the order of the original heading.
Appendix 2: Working Document with Conversion Results and Comments
Forthcoming: A link to our working spreadsheet with the digested results.