Dates must be specified in the data section using the string representation specified in the attribute declaration. This is very useful in text-mining applications, as we can create datasets with string attributes, then write Weka Filters to manipulate strings like StringToWordVectorFilter. For example, if an attribute is the third one declared then Weka expects that all that attributes values will be found in the third comma delimited column.
|Date Added:||20 October 2012|
|File Size:||41.38 Mb|
|Operating Systems:||Windows NT/2000/XP/2003/2003/7/8/10 MacOS 10/X|
|Price:||Free* [*Free Regsitration Required]|
Dates must be specified in the data section using the string representation specified in the attribute declaration. ARFF files have two distinct sections. Attribute values for each instance are delimited by commas. Witten and Eibe Frank the new additions are string attributes, date attributes, and sparse instances.
The default format string accepts the ISO combined date and time format: This is very useful in text-mining applications, as we can create datasets with string attributes, then write Weka Filters to manipulate strings like StringToWordVectorFilter. This explanation was cobbled together by Gordon Paynter gordon.
Attribute declarations take the form of an orderd sequence of attribute statements. If spaces are to be included in the name then the entire name daa be quoted.
It has been edited by Richard Kirkby rkirkby at cs. In Weka, string and nominal data values are stored as numbers; these numbers act as indexes into an swts of possible attribute values this is very efficient.
Attribute-Relation File Format (ARFF)
Instead of representing each value stes order, like this: They must appear in the order that they were declared in the header section i. Values of string and nominal attributes are case sensitive, and any that contain space must be quoted, as follows: The Header of the ARFF file contains the name of the relation, a list of the attributes the columns in the dataand xata types.
An example header on the standard IRIS dataset looks like this:.
There is a known problem saving SparseInstance objects from datasets that have string attributes. The order the attributes arfc declared indicates the column position in the data section of the file. The keywords numericstring and date are case insensitive.
Each attribute in the data set has its own attribute statement which uniquely defines the name of that attribute and it's data type. To get around this problem, add a dummy string value at index 0 that is never used whenever you declare string attributes that are likely to be used in SparseInstance objects and saved as Sparse ARFF files.
If a value is unknown, you must explicitly represent it with a question mark? Each instance is surrounded by curly braces, and the format for each entry is: The relation name is defined as the first line in the ARFF file.
However, the first string value is assigned index 0: The string must be quoted if the name includes spaces. For example, if an attribute is the third cata declared then Weka expects that all that attributes values will be found in the third comma delimited column. The data declaration is a single line denoting the start of the data segment in the file.
Collections of Datasets
String attributes allow us to create attributes containing arbitrary textual values. Missing values are represented by a single question mark, as in: For example, the class value of the Iris dataset can be defined as follows: String attributes are declared as follows: The first section is the Header information, ses is followed the Data information.
When a SparseInstance is written, string instances with internal value 0 are not output, so their string value is lost and when the arff file is read again, the default value 0 is the index of a different string value, so the attribute value appears to change.
Date attribute declarations take the form: