This repository was archived by the owner on Feb 4, 2020. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 97
Trouble reading Harvard Open Metadata MARC files (UTF-8 related?) #89
Copy link
Copy link
Open
Description
I am trying to use pymarc to read the Harvard Open Metadata MARC files.
Most of the files process ok but some (for example ab.bib.14.20160401.full.mrc) produce errors when processing. The error I am getting is:
Traceback (most recent call last):
File "domark.py", line 21, in <module>
for record in reader:
File "/Library/Python/2.7/site-packages/six.py", line 535, in next
return type(self).__next__(self)
File "/Users/markwatkins/Sites/pharvard/pymarc/reader.py", line 97, in __next__
utf8_handling=self.utf8_handling)
File "/Users/markwatkins/Sites/pharvard/pymarc/record.py", line 74, in __init__
utf8_handling=utf8_handling)
File "/Users/markwatkins/Sites/pharvard/pymarc/record.py", line 307, in decode_marc
code = subfield[0:1].decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)
The driver code I am using is:
#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys
from pymarc import MARCReader
UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)
if len(sys.argv) >= 2:
files = [sys.argv[1]]
for file in files:
with open(file, 'rb') as fh:
reader = MARCReader(fh, utf8_handling='ignore')
for record in reader:
# print "%s by %s" % (record.title(), record.author())
print(record.as_json())
Other MARC processing tools (e.g. MarcEdit seem to process the file with no issues so I think the file is legitimate).
Am I doing something wrong? Is there an issue with pymarc, possibly UTF-8 processing related?
Metadata
Metadata
Assignees
Labels
No labels