Encoders

Encoder

class nupic.encoders.base.Encoder

An encoder converts a value to a sparse distributed representation.

This is the base class for encoders that are compatible with the OPF. The OPF requires that values can be represented as a scalar value for use in places like the SDR Classifier.

Note

The Encoder superclass implements:

  • encode() - returns a numpy array encoding the input; syntactic sugar on top of encodeIntoArray. If pprint, prints the encoding to the terminal
  • pprintHeader() - prints a header describing the encoding to the terminal
  • pprint() - prints an encoding to the terminal

Warning

The following methods and properties must be implemented by subclasses:

closenessScores(expValues, actValues, fractional=True)

Compute closeness scores between the expected scalar value(s) and actual scalar value(s). The expected scalar values are typically those obtained from the getScalars() method. The actual scalar values are typically those returned from the topDownCompute() method.

This method returns one closeness score for each value in expValues (or actValues which must be the same length). The closeness score ranges from 0 to 1.0, 1.0 being a perfect match and 0 being the worst possible match.

If this encoder is a simple, single field encoder, then it will expect just 1 item in each of the expValues and actValues arrays. Multi-encoders will expect 1 item per sub-encoder.

Each encoder type can define it’s own metric for closeness. For example, a category encoder may return either 1 or 0, if the scalar matches exactly or not. A scalar encoder might return a percentage match, etc.

Parameters:
  • expValues – Array of expected scalar values, typically obtained from getScalars()
  • actValues – Array of actual values, typically obtained from topDownCompute()
Returns:

Array of closeness scores, one per item in expValues (or actValues).

decode(encoded, parentFieldName='')

Takes an encoded output and does its best to work backwards and generate the input that would have generated it.

In cases where the encoded output contains more ON bits than an input would have generated, this routine will return one or more ranges of inputs which, if their encoded outputs were ORed together, would produce the target output. This behavior makes this method suitable for doing things like generating a description of a learned coincidence in the SP, which in many cases might be a union of one or more inputs.

If instead, you want to figure the most likely single input scalar value that would have generated a specific encoded output, use the topDownCompute() method.

If you want to pretty print the return value from this method, use the decodedToStr() method.

Parameters:
  • encoded – The encoded output that you want decode
  • parentFieldName – The name of the encoder which is our parent. This name is prefixed to each of the field names within this encoder to form the keys of the dict() in the retval.
Returns:

tuple(fieldsDict, fieldOrder) (see below for details)

fieldsDict is a dict() where the keys represent field names (only 1 if this is a simple encoder, > 1 if this is a multi or date encoder) and the values are the result of decoding each field. If there are no bits in encoded that would have been generated by a field, it won’t be present in the dict. The key of each entry in the dict is formed by joining the passed in parentFieldName with the child encoder name using a ‘.’.

Each ‘value’ in fieldsDict consists of (ranges, desc), where ranges is a list of one or more (minVal, maxVal) ranges of input that would generate bits in the encoded output and ‘desc’ is a pretty print description of the ranges. For encoders like the category encoder, the ‘desc’ will contain the category names that correspond to the scalar values included in the ranges.

The fieldOrder is a list of the keys from fieldsDict, in the same order as the fields appear in the encoded output.

TODO: when we switch to Python 2.7 or 3.x, use OrderedDict

Example retvals for a scalar encoder:

{‘amount’: ( [[1,3], [7,10]], ‘1-3, 7-10’ )} {‘amount’: ( [[2.5,2.5]], ‘2.5’ )}

Example retval for a category encoder:

{‘country’: ( [[1,1], [5,6]], ‘US, GB, ES’ )}

Example retval for a multi encoder:

{‘amount’: ( [[2.5,2.5]], ‘2.5’ ),
‘country’: ( [[1,1], [5,6]], ‘US, GB, ES’ )}
decodedToStr(decodeResults)

Return a pretty print string representing the return value from decode().

encode(inputData)

Convenience wrapper for encodeIntoArray.

This may be less efficient because it allocates a new numpy array every call.

Parameters:inputData – undocumented
Returns:a numpy array with the encoded representation of inputData
encodeIntoArray(inputData, output)

Encodes inputData and puts the encoded value into the numpy output array, which is a 1-D array of length returned by getWidth().

Note: The numpy output array is reused, so clear it before updating it.

Parameters:
  • inputData – Data to encode. This should be validated by the encoder.
  • output – numpy 1-D array of same length returned by getWidth()
encodedBitDescription(bitOffset, formatted=False)

Return a description of the given bit in the encoded output. This will include the field name and the offset within the field.

Parameters:
  • bitOffset – Offset of the bit to get the description of
  • formatted – If True, the bitOffset is w.r.t. formatted output, which includes separators
Returns:

tuple(fieldName, offsetWithinField)

formatBits(inarray, outarray, scale=1, blank=255, leftpad=0)

Copy one array to another, inserting blanks between fields (for display) If leftpad is one, then there is a dummy value at element 0 of the arrays, and we should start our counting from 1 rather than 0

Parameters:
  • inarray – TODO: document
  • outarray – TODO: document
  • scale – TODO: document
  • blank – TODO: document
  • leftpad – TODO: document
getBucketIndices(inputData)

Returns an array containing the sub-field bucket indices for each sub-field of the inputData. To get the associated field names for each of the buckets, call getScalarNames().

Parameters:inputData – The data from the source. This is typically a object with members.
Returns:array of bucket indices
getBucketInfo(buckets)

Returns a list of EncoderResult namedtuples describing the inputs for each sub-field that correspond to the bucket indices passed in ‘buckets’. To get the associated field names for each of the values, call getScalarNames().

Parameters:buckets – The list of bucket indices, one for each sub-field encoder. These bucket indices for example may have been retrieved from the getBucketIndices() call.
Returns:A list of EncoderResult namedtuples. Each EncoderResult has three attributes:
-# value: This is the value for the sub-field
in a format that is consistent with the type specified by getDecoderOutputFieldTypes(). Note that this value is not necessarily numeric.
-# scalar: The scalar representation of value. This
number is consistent with what is returned by getScalars(). This value is always an int or float, and can be used for numeric comparisons
-# encoding This is the encoded bit-array (numpy array)
that represents ‘value’. That is, if ‘value’ was passed to encode(), an identical bit-array should be returned
getBucketValues()

Returns a list of items, one for each bucket defined by this encoder. Each item is the value assigned to that bucket, this is the same as the EncoderResult.value that would be returned by getBucketInfo() for that bucket and is in the same format as the input that would be passed to encode().

This call is faster than calling getBucketInfo() on each bucket individually if all you need are the bucket values.

Must be overridden by subclasses.

Returns:list of items, each item representing the bucket value for that bucket.
getDecoderOutputFieldTypes()

Returns a sequence of field types corresponding to the elements in the decoded output field array. The types are defined by nupic.data.fieldmeta.FieldMetaType.

Returns:list of nupic.data.fieldmeta.FieldMetaType objects
getDescription()

This returns a list of tuples, each containing (name, offset). The ‘name’ is a string description of each sub-field, and offset is the bit offset of the sub-field for that encoder.

For now, only the ‘multi’ and ‘date’ encoders have multiple (name, offset) pairs. All other encoders have a single pair, where the offset is 0.

Must be overridden by subclasses.

Returns:list of tuples containing (name, offset)
getDisplayWidth()

Calculate width of display for bits plus blanks between fields.

Returns:width of display for bits plus blanks between fields
getEncodedValues(inputData)

Returns the input in the same format as is returned by topDownCompute(). For most encoder types, this is the same as the input data. For instance, for scalar and category types, this corresponds to the numeric and string values, respectively, from the inputs. For datetime encoders, this returns the list of scalars for each of the sub-fields (timeOfDay, dayOfWeek, etc.)

This method is essentially the same as getScalars() except that it returns strings

Parameters:inputData – The input data in the format it is received from the data source
Returns:A list of values, in the same format and in the same order as they

are returned by topDownCompute.

getEncoderList()
Returns:a reference to each sub-encoder in this encoder. They are returned in the same order as they are for getScalarNames() and getScalars().
getFieldDescription(fieldName)

Return the offset and length of a given field within the encoded output.

Parameters:fieldName – Name of the field
Returns:tuple(offset, width) of the field within the encoded output
getScalarNames(parentFieldName='')

Return the field names for each of the scalar values returned by getScalars.

Parameters:parentFieldName – The name of the encoder which is our parent. This name is prefixed to each of the field names within this encoder to form the keys of the dict() in the retval.
Returns:array of field names
getScalars(inputData)

Returns a numpy array containing the sub-field scalar value(s) for each sub-field of the inputData. To get the associated field names for each of the scalar values, call getScalarNames().

For a simple scalar encoder, the scalar value is simply the input unmodified. For category encoders, it is the scalar representing the category string that is passed in. For the datetime encoder, the scalar value is the the number of seconds since epoch.

The intent of the scalar representation of a sub-field is to provide a baseline for measuring error differences. You can compare the scalar value of the inputData with the scalar value returned from topDownCompute() on a top-down representation to evaluate prediction accuracy, for example.

Parameters:inputData – The data from the source. This is typically a object with members
Returns:array of scalar values
getWidth()

Should return the output width, in bits.

Returns:output width in bits
pprint(output, prefix='')

Pretty-print the encoded output using ascii art.

Parameters:
  • output – to print
  • prefix – printed before the header if specified
pprintHeader(prefix='')

Pretty-print a header that labels the sub-fields of the encoded output. This can be used in conjuction with pprint.

Parameters:prefix – printed before the header if specified
scalarsToStr(scalarValues, scalarNames=None)

Return a pretty print string representing the return values from getScalars and getScalarNames().

Parameters:
  • scalarValues – input values to encode to string
  • scalarNames – optional input of scalar names to convert. If None, gets scalar names from getScalarNames()
Returns:

string representation of scalar values

setFieldStats(fieldName, fieldStatistics)

This method is called by the model to set the statistics like min and max for the underlying encoders if this information is available.

Parameters:
  • fieldName – name of the field this encoder is encoding, provided by multiencoder
  • fieldStatistics – dictionary of dictionaries with the first level being the fieldname and the second index the statistic ie: fieldStatistics[‘pounds’][‘min’]
setLearning(learningEnabled)

Set whether learning is enabled.

Parameters:learningEnabled – whether learning should be enabled
setStateLock(lock)

Setting this to true freezes the state of the encoder This is separate from the learning state which affects changing parameters. Implemented in subclasses.

topDownCompute(encoded)

Returns a list of EncoderResult namedtuples describing the top-down best guess inputs for each sub-field given the encoded output. These are the values which are most likely to generate the given encoded output. To get the associated field names for each of the values, call getScalarNames().

Parameters:encoded – The encoded output. Typically received from the topDown outputs from the spatial pooler just above us.
Returns:A list of EncoderResult namedtuples. Each EncoderResult has three attributes:
-# value: This is the best-guess value for the sub-field
in a format that is consistent with the type specified by getDecoderOutputFieldTypes(). Note that this value is not necessarily numeric.
-# scalar: The scalar representation of this best-guess
value. This number is consistent with what is returned by getScalars(). This value is always an int or float, and can be used for numeric comparisons.
-# encoding This is the encoded bit-array (numpy array)
that represents the best-guess value. That is, if ‘value’ was passed to encode(), an identical bit-array should be returned.

Scalar Encoder

class nupic.encoders.scalar.ScalarEncoder(w, minval, maxval, periodic=False, n=0, radius=0, resolution=0, name=None, verbosity=0, clipInput=False, forced=False)

Bases: nupic.encoders.base.Encoder

A scalar encoder encodes a numeric (floating point) value into an array of bits. The output is 0’s except for a contiguous block of 1’s. The location of this contiguous block varies continuously with the input value.

The encoding is linear. If you want a nonlinear encoding, just transform the scalar (e.g. by applying a logarithm function) before encoding. It is not recommended to bin the data as a pre-processing step, e.g. “1” = $0 - $.20, “2” = $.21-$0.80, “3” = $.81-$1.20, etc. as this removes a lot of information and prevents nearby values from overlapping in the output. Instead, use a continuous transformation that scales the data (a piecewise transformation is fine).

Warning

There are three mutually exclusive parameters that determine the overall size of of the output. Exactly one of n, radius, resolution must be set. “0” is a special value that means “not set”.

Parameters:
  • w – The number of bits that are set to encode a single value - the “width” of the output signal restriction: w must be odd to avoid centering problems.
  • minval – The minimum value of the input signal.
  • maxval – The upper bound of the input signal. (input is strictly less if periodic == True)
  • periodic – If true, then the input value “wraps around” such that minval = maxval For a periodic value, the input must be strictly less than maxval, otherwise maxval is a true upper bound.
  • n – The number of bits in the output. Must be greater than or equal to w
  • radius – Two inputs separated by more than the radius have non-overlapping representations. Two inputs separated by less than the radius will in general overlap in at least some of their bits. You can think of this as the radius of the input.
  • resolution – Two inputs separated by greater than, or equal to the resolution are guaranteed to have different representations.
  • name – an optional string which will become part of the description
  • clipInput – if true, non-periodic inputs smaller than minval or greater than maxval will be clipped to minval/maxval
  • forced – if true, skip some safety checks (for compatibility reasons), default false

Note

Radius and resolution are specified with respect to the input, not output. w is specified with respect to the output.

Example: day of week

w = 3
Minval = 1 (Monday)
Maxval = 8 (Monday)
periodic = true
n = 14
[equivalently: radius = 1.5 or resolution = 0.5]

The following values would encode midnight – the start of the day

monday (1)   -> 11000000000001
tuesday(2)   -> 01110000000000
wednesday(3) -> 00011100000000
...
sunday (7)   -> 10000000000011

Since the resolution is 12 hours, we can also encode noon, as

monday noon  -> 11100000000000
monday midnt-> 01110000000000
tuesday noon -> 00111000000000
etc.

`n` vs `resolution`

It may not be natural to specify “n”, especially with non-periodic data. For example, consider encoding an input with a range of 1-10 (inclusive) using an output width of 5. If you specify resolution = 1, this means that inputs of 1 and 2 have different outputs, though they overlap, but 1 and 1.5 might not have different outputs. This leads to a 14-bit representation like this:

1 ->  11111000000000  (14 bits total)
2 ->  01111100000000
...
10->  00000000011111
[resolution = 1; n=14; radius = 5]

You could specify resolution = 0.5, which gives

1   -> 11111000... (22 bits total)
1.5 -> 011111.....
2.0 -> 0011111....
[resolution = 0.5; n=22; radius=2.5]

You could specify radius = 1, which gives

1   -> 111110000000....  (50 bits total)
2   -> 000001111100....
3   -> 000000000011111...
...
10  ->                           .....000011111
[radius = 1; resolution = 0.2; n=50]

An N/M encoding can also be used to encode a binary value, where we want more than one bit to represent each state. For example, we could have: w = 5, minval = 0, maxval = 1, radius = 1 (which is equivalent to n=10)

0 -> 1111100000
1 -> 0000011111

Implementation details

range = maxval - minval
h = (w-1)/2  (half-width)
resolution = radius / w
n = w * range/radius (periodic)
n = w * range/radius + 2 * h (non-periodic)
closenessScores(expValues, actValues, fractional=True)

See the function description in base.py

decode(encoded, parentFieldName='')

See the function description in base.py

encodeIntoArray(input, output, learn=True)

See method description in base.py

getBucketIndices(input)

See method description in base.py

getBucketInfo(buckets)

See the function description in base.py

getBucketValues()

See the function description in base.py

getDecoderOutputFieldTypes()

[Encoder class virtual method override]

topDownCompute(encoded)

See the function description in base.py

Random Distributed Scalar Encoder

class nupic.encoders.random_distributed_scalar.RandomDistributedScalarEncoder(resolution, w=21, n=400, name=None, offset=None, seed=42, verbosity=0)

Bases: nupic.encoders.base.Encoder

A scalar encoder encodes a numeric (floating point) value into an array of bits.

This class maps a scalar value into a random distributed representation that is suitable as scalar input into the spatial pooler. The encoding scheme is designed to replace a simple ScalarEncoder. It preserves the important properties around overlapping representations. Unlike ScalarEncoder the min and max range can be dynamically increased without any negative effects. The only required parameter is resolution, which determines the resolution of input values.

Scalar values are mapped to a bucket. The class maintains a random distributed encoding for each bucket. The following properties are maintained by RandomDistributedEncoder:

1) Similar scalars should have high overlap. Overlap should decrease smoothly as scalars become less similar. Specifically, neighboring bucket indices must overlap by a linearly decreasing number of bits.

2) Dissimilar scalars should have very low overlap so that the SP does not confuse representations. Specifically, buckets that are more than w indices apart should have at most maxOverlap bits of overlap. We arbitrarily (and safely) define “very low” to be 2 bits of overlap or lower.

Properties 1 and 2 lead to the following overlap rules for buckets i and j:

If abs(i-j) < w then:
overlap(i,j) = w - abs(i-j)
else:
overlap(i,j) <= maxOverlap

3) The representation for a scalar must not change during the lifetime of the object. Specifically, as new buckets are created and the min/max range is extended, the representation for previously in-range sscalars and previously created buckets must not change.

encodeIntoArray(x, output)

See method description in base.py

getBucketIndices(x)

See method description in base.py

getDecoderOutputFieldTypes()

See method description in base.py

getWidth()

See method description in base.py

mapBucketIndexToNonZeroBits(index)

Given a bucket index, return the list of non-zero bits. If the bucket index does not exist, it is created. If the index falls outside our range we clip it.

@param index The bucket index to get non-zero bits for. @returns numpy array of indices of non-zero bits for specified index.

DateEncoder

class nupic.encoders.date.DateEncoder(season=0, dayOfWeek=0, weekend=0, holiday=0, timeOfDay=0, customDays=0, name='', forced=True)

Bases: nupic.encoders.base.Encoder

A date encoder encodes a date according to encoding parameters specified in its constructor. The input to a date encoder is a datetime.datetime object. The output is the concatenation of several sub-encodings, each of which encodes a different aspect of the date. Which sub-encodings are present, and details of those sub-encodings, are specified in the DateEncoder constructor.

Each parameter describes one attribute to encode. By default, the attribute is not encoded.

season (season of the year; units = day):
(int) width of attribute; default radius = 91.5 days (1 season) (tuple) season[0] = width; season[1] = radius
dayOfWeek (monday = 0; units = day)
(int) width of attribute; default radius = 1 day (tuple) dayOfWeek[0] = width; dayOfWeek[1] = radius
weekend (boolean: 0, 1)
(int) width of attribute
holiday (boolean: 0, 1)
(int) width of attribute
timeOfday (midnight = 0; units = hour)
(int) width of attribute: default radius = 4 hours (tuple) timeOfDay[0] = width; timeOfDay[1] = radius

customDays TODO: what is it?

forced (default True) : if True, skip checks for parameters’ settings; see encoders/scalar.py for details

encodeIntoArray(input, output)

See method description in base.py

getBucketIndices(input)

See method description in base.py

getEncodedValues(input)

See method description in base.py

getScalarNames(parentFieldName='')

See method description in base.py

getScalars(input)

See method description in base.py

input: A datetime object representing the time being encoded

Returns: A numpy array of the corresponding scalar values in

the following order:

[season, dayOfWeek, weekend, holiday, timeOfDay]

Note: some of these fields might be omitted if they were not specified in the encoder

MultiEncoder

class nupic.encoders.multi.MultiEncoder(encoderDescriptions=None)

Bases: nupic.encoders.base.Encoder

A MultiEncoder encodes a dictionary or object with multiple components. A MultiEncode contains a number of sub-encoders, each of which encodes a separate component.

addMultipleEncoders(fieldEncodings)
fieldEncodings – a dict of dicts, mapping field names to the field params
dict.

Each field params dict has the following keys 1) data fieldname that matches the key (‘fieldname’) 2) an encoder type (‘type’) 3) and the encoder params (all other keys)

For example, fieldEncodings={

‘dateTime’: dict(fieldname=’dateTime’, type=’DateEncoder’,
timeOfDay=(5,5)),
‘attendeeCount’: dict(fieldname=’attendeeCount’, type=’ScalarEncoder’,
name=’attendeeCount’, minval=0, maxval=250, clipInput=True, w=5, resolution=10),
‘consumption’: dict(fieldname=’consumption’,type=’ScalarEncoder’,
name=’consumption’, minval=0,maxval=110, clipInput=True, w=5, resolution=5),

}

would yield a vector with a part encoded by the DateEncoder, and to parts seperately taken care of by the ScalarEncoder with the specified parameters. The three seperate encodings are then merged together to the final vector, in such a way that they are always at the same location within the vector.

getWidth()

Represents the sum of the widths of each fields encoding.

Data

FieldMetaType

class nupic.data.fieldmeta.FieldMetaType

Public values for the field data types

_ALL = ('string', 'datetime', 'int', 'float', 'bool', 'list', 'sdr')
boolean = 'bool'
datetime = 'datetime'
float = 'float'
integer = 'int'
classmethod isValid(fieldDataType)

Check a candidate value whether it’s one of the valid field data types

Parameters:fieldDataType (str) – candidate field data type
Returns:True if the candidate value is a legitimate field data type value; False if not
Return type:bool
list = 'list'
sdr = 'sdr'
string = 'string'