Skip to main content
ExLibris
  • Subscribe by RSS
  • ExLibris Dev

    How to identify records in a DC-XML file that have no file streams?

    • Article Type: General
    • Product: DigiTool
    • Product Version: 3

    Description:
    Is there a script to identify records in a DC-XML input file that have no associated file streams and will also delete them?

    Resolution:
    Attached please find a zip file which includes two small XSL files and 1 java file.

    You can unzip the file and upload those three files to any place on your DigiTool server and then run it as below:

    1. Compile the java code: javac -cp .:$jdtlh_thirdparty/jdom/lib/jdom.jar dcSplit.java

    2. Create two files from the original dc.xml file: one file (1.xml) includes all the records with dcterms:hasFormat element, and one file includes all the records without that element:

    java –cp .:$jdtlh_thirdparty/jdom/lib/jdom.jar dcSplit dc.xml 1.xml 2.xml

    There is one issue with the namespace handling and you still need to open the output file, for example 1.xml file, in a text editor and then use the replace function to make the following global changes:

    From: <record>
    To: <record xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/">

    Also, you need to change the following line from: <records xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/">

    To: <records>

    If you know how to resolve this namespace issue by updating the xsl files, these manual changes can be avoided.

    This would work to split it before the ingest.

    In the event that everything has already been ingested, the following SQL statement identifies the duplicate records to be deleted:

    select count(pid) from hdecontrol where ingestid = 'ing1205' and entitytype = 'COMPLEX' and id not in (select control from hderelation);

    Using a modified version of the script, you can pull a list of pids from the database and verify that these are indeed the ones to be deleted.

    You may do the following to delete those records:

    1. Run a SQL query:

    Update hdecontrol set partitiona = ‘tobedeleted” where ingestid = 'ing1205' and entitytype = 'COMPLEX' and id not in (select control from hderelation);

    2. Run the “Re-indexing job” to first index the new partition A value:

    a. Log on to the web management module (mng) and then select “Maintenance” on the top menu
    b. Select “Re-indexing job”, click Next
    c. Search by ingest id: ing1205, and re-index all the records of ing1205

    3. Run the “Delete Digital Entities” service to delete all records with partition A equal to “tobedeleted”:

    a. Log on to the web management module (mng) and then select “Maintenance” on the top menu
    b. Select “Delete Digital Entities”, click Next
    c. Find records by “Partition A” and delete

    Additional Information

    DC, XML, script, file_stream, file stream, delete, remove


    • Article last edited: 10/8/2013
    //doorbell.io feedback widged