Scanning, indexing and OCR processing for newspaper holdings from Hamburg libraries and libraries from outside Hamburg in the amount of approx. 1 251 298 scans (including inserts) according to specifications plus exclusive OCR processing for 552 924 existing scans of newspaper pages.
- Number of original volumes to be processed = approx. 306,
- Number of rolls of microfilm to be processed = approx. 143
- Scan / index number of pages / OCR = approx. 509 280,
- Number of pages only OCR = 426 050,
- Number of OCR units DIN A 4 = 1 826 300.
Conditions:
Digitization v. volumes
- Scanning in 24 bit full color at 300dpi (based on the quality standard Metamorfoze extra light),
- Inclusion of single pages and straight lines of tilted pages,
- Cropping - cropping with a sufficiently wide black border (5-10 mm),
- cropping of small-format pages (e.g. inserts, against the background of larger-format pages),
- If necessary, rotating the scan in the reading direction.
- Delivery of the pictures as tiffs (uncompressed):
—— without multiple frames / not multipage,
—— without alpha channel,
—— color space: eciRGBv2,
—— not created from an internal JPEG.
- The scans of a newspaper are summarized in a directory structure by year. An acronym is added to each newspaper (see table above).
—— Below the annual unit, a separate directory is created for each day according to the “Acronym_YYYYMMDD” convention for the directory name. If there is more than one issue per day, this must be marked - exact specification is given when placing the order,
—— The scans are named numerically in ascending order within the directory per day. The numeric part has eight digits (e.g. 00000001.tif).
- Occurring supplements or special editions with their own title are filed in the order of their occurrence while the scan numbering of the daily edition is continued in their directory without further identification,
- Missing pages are documented in a work log of the contractor.
For digitizing microfilms:
- Preliminary analysis of the single roll to determine the best scan parameters
- Scan of double-sided images in 8-bit grayscale 300 dpi - in relation to the original size
- Separation into single pages and straight lines from tilted pages
- Cropping - cropping with a sufficiently wide black border (5-10 mm)
- Cropping of small-format pages (e.g. inserts, against the background of larger-format pages)
- If necessary, rotating the scan in the reading direction
- Delivery of the pictures as tiffs (uncompressed)
—— without multiple frames / not multipage,
—— without alpha channel,
—— not created from an internal JPEG.
- The scans of a newspaper are summarized in a directory structure by year. An acronym is added to each newspaper (see table above).
—— Below the annual unit, a separate directory is created for each day according to the “Acronym_YYYYMMDD” convention for the directory name. If there is more than one issue per day, this must be marked - exact specification is given when placing the order,
—— The scans are named numerically in ascending order within the directory per day. The numeric part has eight digits (e.g. 00000001.tif).
- Occurring supplements or special editions with their own title are filed in the order of their occurrence while the scan numbering of the daily edition is continued in their directory without further identification,
- Missing pages are documented in a work log of the contractor,
- Available filming reports with a reference to known missing pages and other special features are provided by the client.
For OCR processing:
The result of the OCR is expected in the format ALTO-XML using the ALTO schema version 2.0 in an XML v1.0 (with UTF-8 encoding). “Pixel” is specified for the 'Measurement Unit' parameter.
The file name is assigned in the same way as the image file (e.g. 00000001.xml) and the files are stored in the directory structure described in parallel to the image files.
Collection and return delivery:
- Commerzbibliothek of the Hamburg Chamber of Commerce,
- Hamburg State and University Library,
- Hamburg State Archives, library,
- Bremen State and University Library,
- Schleswig-Holstein State Library Kiel.
- Number of original volumes to be processed = approx. 360 (thereof 337 State Archives Hamburg, some may be replaced by microfilms),
- Number of rolls of microfilm to be processed = approx. 16,
- Scan / index number of pages / OCR = approx. 404 228,
- Number of pages only OCR = 126 874,
- Number of OCR units DIN A 4 = 2 124 408.
Conditions:
Digitization v. volumes
- Scanning in 24 bit full color at 300dpi (based on the quality standard Metamorfoze extra light)
- Inclusion of single pages and straight lines of tilted pages
- Cropping - cropping with a sufficiently wide black border (5-10 mm)
- Cropping of small format pages (e.g. inserts, against the background of larger format pages)
- If necessary, rotating the scan in the reading direction
- Delivery of the pictures as tiffs (uncompressed)
—— without multiple frames / not multipage,
—— without alpha channel,
—— color space: eciRGBv2,
—— not created from an internal JPEG.
- The scans of a newspaper are summarized in a directory structure by year. An acronym is added to each newspaper (see table above).
—— Below the annual unit, a separate directory is created for each day according to the “Acronym_YYYYMMDD” convention for the directory name. If there is more than one issue per day, this must be marked - exact specification is given when placing the order,
—— The scans are named numerically in ascending order within the directory per day. The numeric part has eight digits (e.g. 00000001.tif).
- Occurring supplements or special editions with their own title are filed in the order of their occurrence while the scan numbering of the daily edition is continued in their directory without further identification,
- Missing pages are documented in a work log of the contractor.
For digitizing microfilms:
- Preliminary analysis of the single role to determine the best scan parameters,
- Scan of the double-page recordings in 8-bit grayscale 300 dpi - in relation to the original size,
- separation into individual pages and straight lines from tilted pages,
- Cropping - cropping with a sufficiently wide black border (5-10 mm),
- cropping of small-format pages (e.g. inserts, against the background of larger-format pages),
- if necessary, rotating the scan in the reading direction,
- Delivery of the pictures as tiffs (uncompressed).
—— Without multiple frames / not multipage,
—— without alpha channel,
—— not created from an internal JPEG.
- The scans of a newspaper are summarized in a directory structure by year. An acronym is added to each newspaper (see table above).
—— Below the annual unit, a separate directory is created for each day according to the “Acronym_YYYYMMDD” convention for the directory name. If there is more than one issue per day, this must be marked - exact specification is given when placing the order,
—— Within the directory per day, the scans are named numerically in ascending order. The numeric part has eight digits (e.g. 00000001.tif).
- Occurring supplements or special editions with their own title are filed in the order of their occurrence while the scan numbering of the daily edition is continued in their directory without further identification,
- Missing pages are documented in a work log of the contractor,
- Available filming reports with a reference to known missing pages and other special features are provided by the client.
For OCR processing:
The result of the OCR is expected in the format ALTO-XML using the ALTO schema version 2.0 in an XML v1.0 (with UTF-8 encoding). “Pixel” is specified for the 'Measurement Unit' parameter.
The file name is assigned in the same way as the image file (e.g. 00000001.xml) and the files are stored in the directory structure described in parallel to the image files.
Collection and return delivery:
- Hamburg State Archives, library,
- Hamburg State and University Library,
- ZBW - Leibniz Information Center for Economics, Kiel location.
- Number of original volumes to be processed = approx. 296,
- Number of rolls of microfilm to be processed = approx. 44,
- Scan / index number of pages / OCR = approx. 337 790,
- Number of OCR units DIN A 4 = 1 323 008.
Conditions:
Digitization v. volumes
- Scanning in 24 bit full color at 300dpi (based on the quality standard Metamorfoze extra light)
- Inclusion of single pages and straight lines of tilted pages
- Cropping - cropping with a sufficiently wide black border (5-10 mm)
- Cropping of small-format pages (e.g. inserts, against the background of larger-format pages)
- If necessary, rotating the scan in the reading direction
- Delivery of the pictures as tiffs (uncompressed)
—— without multiple frames / not multipage,
—— without alpha channel,
—— color space: eciRGBv2,
—— not created from an internal JPEG.
- The scans of a newspaper are summarized in a directory structure by year. An acronym is added to each newspaper (see table above).
—— Below the annual unit, a separate directory is created for each day according to the “Acronym_YYYYMMDD” convention for the directory name. If there is more than one issue per day, this must be marked - exact specification is given when placing the order,
—— Within the directory per day, the scans are named numerically in ascending order. The numeric part has eight digits (e.g. 00000001.tif).
- Occurring supplements or special editions with their own title are filed in the order of their occurrence while the scan numbering of the daily edition is continued in their directory without further identification,
- Missing pages are documented in a work log of the contractor.
For digitizing microfilms:
- Preliminary analysis of the single role to determine the best scan parameters,
- Scan of the double-page recordings in 8-bit grayscale 300 dpi - in relation to the original size,
- separation into individual pages and straight lines from tilted pages,
- Cropping - cropping with a sufficiently wide black border (5-10 mm),
- cropping of small-format pages (e.g. inserts, against the background of larger-format pages),
- if necessary, rotating the scan in the reading direction,
- Delivery of the pictures as tiffs (uncompressed)
—— without multiple frames / not multipage,
—— without alpha channel,
—— not created from an internal JPEG.
- The scans of a newspaper are summarized in a directory structure by year. An acronym is added to each newspaper (see table above).
—— Below the annual unit, a separate directory is created for each day according to the “Acronym_YYYYMMDD” convention for the directory name. If there is more than one issue per day, this must be marked - exact specification is given when placing the order,
—— The scans are named numerically in ascending order within the directory per day. The numeric part has eight digits (e.g. 00000001.tif).
- Occurring supplements or special editions with their own title are filed in the order of their occurrence while the scan numbering of the daily edition is continued in their directory without further identification,
- Missing pages are documented in a work log of the contractor,
- Available filming reports with a reference to known missing pages and other special features are provided by the client.
For OCR processing:
The result of the OCR is expected in the format ALTO-XML using the ALTO schema version 2.0 in an XML v1.0 (with UTF-8 encoding). “Pixel” is specified for the 'Measurement Unit' parameter.
The file name is assigned in the same way as the image file (e.g. 00000001.xml) and the files are stored in the directory structure described in parallel to the image files.
Collection and return delivery:
- Hamburg State Archives, library,
- Research Center for Contemporary History Hamburg,
- Hamburg State and University Library,
- library of the Federal Archives Berlin,
- Library of the Friedrich-Ebert-Stiftung Bonn.