Class CifDataParser
- All Implemented Interfaces:
GenericCifDataParser
- Direct Known Subclasses:
Cif2DataParser
regarding the treatment of single quotes vs. primes in cif file, PMR wrote:
* There is a formal grammar for CIF (see http://www.iucr.org/iucr-top/cif/index.html) which confirms this. The textual explanation is
14. Matching single or double quote characters (' or ") may be used to bound a string representing a non-simple data value provided the string does not extend over more than one line.
15. Because data values are invariably separated from other
tokens in the file by white space, such a quote-delimited
character string may contain instances of the character used
to delimit the string provided they are not followed by white
space. For example, the data item
_example 'a dog's life'
is legal; the data value is a dog's life.
[PMR - the terminating character(s) are quote+whitespace.
That would mean that:
_example 'Jones' life'
would be an error
The CIF format was developed in that late 1980's under the aegis of the International Union of Crystallography (I am a consultant to the COMCIFs committee). It was ratified by the Union and there have been several workshops. mmCIF is an extension of CIF which includes a relational structure. The formal publications are:
Hall, S. R. (1991). "The STAR File: A New Format for Electronic Data Transfer and Archiving", J. Chem. Inform. Comp. Sci., 31, 326-333. Hall, S. R., Allen, F. H. and Brown, I. D. (1991). "The Crystallographic Information File (CIF): A New Standard Archive File for Crystallography", Acta Cryst., A47, 655-685. Hall, S.R. invalid input: '&' Spadaccini, N. (1994). "The STAR File: Detailed Specifications," J. Chem. Info. Comp. Sci., 34, 505-508.
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected booleanA flag to create and return Java objects, not strings.protected intlength of strprotected intprotected String[]protected charoptional token terminator; in CIF 2.0 could be } or ]protected booleandebugging flag passed from reader; unusedprotected booleanA global, static map that contains field information.protected intpointer to current character on strstatic final intThe maximum number of columns (data keys) passed to the parser or found in the file for a given loop_ or category.subkey listing.protected Stringfrom buffered readerstatic Stringstring to return for CIF data value .protected Stringworking string (buffer)protected booleanwhether we are processing an unquoted value or keyFields inherited from interface GenericCifDataParser
EMPTY, NONE -
Constructor Summary
ConstructorsConstructorDescription///////////////////////////////////////////////////////////// -
Method Summary
Modifier and TypeMethodDescriptionSwitch '." to "_" in a key, and also make any quirky key name changes that need to be doneParses all CIF data for a reader defined in the constructor into a standard Map structure and close the BufferedReader if it exists.getAllCifDataType(String... types) intgetColumnData(int i) Column i in the current rowgetColumnName(int i) booleangetData()The work horse; a general reader for loop data.first checks to see if the next token is an unquoted control code, and if so, returns nullGet a token as a String value (for the reader)Get the token as a Java Objectprotected ObjectJust makes sureprotected ObjectgetQuotedStringOrObject(char ch) CIF 1.0 only.protected intprotected booleanisQuote(char ch) CIF 1.0 only; we handle various quote types hereprotected booleanisTerminator(char c) The token terminator is space or tab in CIF 1.0, but it can be quoted strings in CIF 2.0.voidparseDataBlockParameters(String[] fields, String key, String data, int[] key2col, int[] col2key) Process a data block, with or without a loop_.Just look at the next token.protected Stringsets the string for parsing to be from the next line when the token buffer is empty, and if ';' is at the beginning of that line, extends the string to include that full multiline string.protected StringEncapsulate a multi-line ; ....protected StringPreprocess the string on a line starting with a semicolon to produce a string with a \1 ...readLine()readList()Read a CIF 2.0 list structure, converting it to either a JSON string or to a Java data structureset(GenericLineReader reader, BufferedReader br, boolean debugging) A Chemical Information File data parser.voidsetNullValue(String nullString) Set the string value of what is returned for "." and "?"protected Stringsets global str and line to be parsed from the beginning \1 ....skipLoop(boolean doReport) Skips all associated loop data.Only translating the basic Greek set here, not all the other stuff.protected ObjectIn CIF 2.0, this method turns a String into an Integer or Float In CIF 1.0 (here) just return the unchanged value.
-
Field Details
-
KEY_MAX
public static final int KEY_MAXThe maximum number of columns (data keys) passed to the parser or found in the file for a given loop_ or category.subkey listing.- See Also:
-
line
from buffered reader -
str
working string (buffer) -
ich
protected int ichpointer to current character on str -
cch
protected int cchlength of str -
wasUnquoted
protected boolean wasUnquotedwhether we are processing an unquoted value or key -
cterm
protected char ctermoptional token terminator; in CIF 2.0 could be } or ] -
nullString
string to return for CIF data value . and ? for CIF2 reader, "." -
asObject
protected boolean asObjectA flag to create and return Java objects, not strings. Used only by Jmol scripting x = getProperty("cifInfo", filename). -
debugging
protected boolean debuggingdebugging flag passed from reader; unused -
columnCount
protected int columnCount -
columnNames
-
haveData
protected boolean haveData -
htFields
-
-
Constructor Details
-
CifDataParser
public CifDataParser()/////////////////////////////////////////////////////////////
-
-
Method Details
-
getVersion
protected int getVersion() -
setNullValue
Set the string value of what is returned for "." and "?"- Parameters:
nullString- null here returns "." and "?"; default is "\0"
-
getColumnData
Column i in the current row- Specified by:
getColumnDatain interfaceGenericCifDataParser
-
getColumnCount
public int getColumnCount()- Specified by:
getColumnCountin interfaceGenericCifDataParser
-
getColumnName
- Specified by:
getColumnNamein interfaceGenericCifDataParser
-
set
A Chemical Information File data parser. set() should be called immediately upon construction. Two options; one of reader or br should be null, or reader will be ignored. Just simpler this way...- Specified by:
setin interfaceGenericCifDataParser- Parameters:
reader- Anything that can deliver a line of text or nullbr- A standard BufferedReader.debugging-
-
getFileHeader
- Specified by:
getFileHeaderin interfaceGenericCifDataParser- Returns:
- commented-out section at the start of a CIF file.
-
getAllCifData
Parses all CIF data for a reader defined in the constructor into a standard Map structure and close the BufferedReader if it exists.- Specified by:
getAllCifDatain interfaceGenericCifDataParser- Returns:
- Hashtable of models Vector of Hashtable data
-
getAllCifDataType
- Specified by:
getAllCifDataTypein interfaceGenericCifDataParser
-
readLine
- Specified by:
readLinein interfaceGenericCifDataParser
-
getData
The work horse; a general reader for loop data. Fills colunnData with fieldCount fields.- Specified by:
getDatain interfaceGenericCifDataParser- Returns:
- false if EOF
- Throws:
Exception
-
skipLoop
Skips all associated loop data. (Skips to next control word.)- Specified by:
skipLoopin interfaceGenericCifDataParser- Throws:
Exception
-
getNextToken
Get a token as a String value (for the reader)- Specified by:
getNextTokenin interfaceGenericCifDataParser- Returns:
- the next token of any kind, or null
- Throws:
Exception
-
getNextTokenObject
-
getNextTokenProtected
-
getNextDataToken
first checks to see if the next token is an unquoted control code, and if so, returns null- Specified by:
getNextDataTokenin interfaceGenericCifDataParser- Returns:
- next data token or null
- Throws:
Exception
-
peekToken
Just look at the next token. Saves it for retrieval using getTokenPeeked()- Specified by:
peekTokenin interfaceGenericCifDataParser- Returns:
- next token or null if EOF
- Throws:
Exception
-
getTokenPeeked
- Specified by:
getTokenPeekedin interfaceGenericCifDataParser- Returns:
- the token last acquired; may be null
-
toUnicode
Only translating the basic Greek set here, not all the other stuff. See http://www.iucr.org/resources/cif/spec/version1.1/semantics#markup- Specified by:
toUnicodein interfaceGenericCifDataParser- Parameters:
data-- Returns:
- cleaned string
-
parseDataBlockParameters
public void parseDataBlockParameters(String[] fields, String key, String data, int[] key2col, int[] col2key) throws Exception Process a data block, with or without a loop_. Passed an array of field names, this method fills two int[] arrays. The first, key2col, maps desired key values to actual order of appearance (column number) in the file; the second, col2key, is a reverse loop-up for that, mapping column numbers to desired field indices. When called within a loop_ context, this.columnData will be created but not filled. Alternatively, if fields is null, then this.fieldNames is filled, in order, with key data, and both key2col and col2key will be simply 0,1,2,... This array is used in cases such as matrices for which there are simply too many possibilities to list, and the key name itself contains information that we need. When not a loop_ context, keys are expected to be in the mmCIF form category.subkey and will be unique within a data block (see http://mmcif.wwpdb.org/docs/tutorials/mechanics/pdbx-mmcif-syntax.html). Keys and data will be read for all data in the same category, filling this.columnData. In this way, the calling class does not need to enumerate all possible category names, but instead can focus on just those of interest.- Specified by:
parseDataBlockParametersin interfaceGenericCifDataParser- Parameters:
fields- list of normalized field names, such as "_pdbx_struct_assembly_gen_assembly_id" (with "_" instead of ".")key- null to indicate a loop_ construct, otherwise the initial category.subkey founddata- when not loop_ the initial data read, otherwise ignoredkey2col- map of desired keys to actual columnscol2key- map of actual columns to desired keys- Throws:
Exception
-
fixKey
Switch '." to "_" in a key, and also make any quirky key name changes that need to be done- Specified by:
fixKeyin interfaceGenericCifDataParser
-
setString
-
prepareNextLine
sets the string for parsing to be from the next line when the token buffer is empty, and if ';' is at the beginning of that line, extends the string to include that full multiline string. Uses \1 to indicate that this is a special quotation.- Returns:
- the next line or null if EOF
- Throws:
Exception
-
preprocessString
-
preprocessSemiString
-
unquoted
-
isTerminator
protected boolean isTerminator(char c) The token terminator is space or tab in CIF 1.0, but it can be quoted strings in CIF 2.0.- Parameters:
c-- Returns:
- true if this character is a terminator
-
isQuote
protected boolean isQuote(char ch) CIF 1.0 only; we handle various quote types here- Parameters:
ch-- Returns:
- true if this character is a (starting) quote
-
getQuotedStringOrObject
CIF 1.0 only.- Parameters:
ch- current character being pointed to- Returns:
- a String data object
-
readList
-
skipNextToken
- Specified by:
skipNextTokenin interfaceGenericCifDataParser- Throws:
Exception
-