Frequently Asked Questions
These questions are culled from our public forum and support team. If you have a question to contribute (or better still, a question and answer) please post it on the OmniSci Community Forum.
- Architecture
- Troubleshooting
- Why do I keep running out of memory for GPU queries?
- Why do I keep running out of memory for rendering?
- How can I confirm that OmniSci is actually running on GPUs?
- How do I compare the performance on GPUs vs. CPUs to demonstrate the performance gain of GPUs?
- Does OmniSci support a single server with different GPUs?
- What is watchdog, and when should it heel?
- What do you mean by binning numbers?
- Why am I seeing the error "Distributed support is disabled"?
- How can I avoid creating duplicate rows?
- Why am I seeing the error "Too many open files…errno 24."?
- When I try to ingest a file using pandas.read_csv and pymapd.load_table, I get a TypeError. Why?
- Integrations and Connectors
- Licensing
- Tips and Tricks
Architecture
In Immerse, which charts are rendered server-side (using GPUs) versus in the browser (CPUs)?
Choropleth, Geo Heatmap, Linemap, Pointmap, and Scatter Plot are all rendered in GPU by OmniSci Render; the rest are handled on the browser.
Is OmniSci backward compatible?
OmniSci is not backward compatible. With every release, new efficiencies are introduced that are not necessarily compatible with the previous version. As with any database, you should always back up your data before migrating to a later version, just in case you have to revert for any reason.
Troubleshooting
Why do I keep running out of memory for GPU queries?
This typically occurs when the system cannot keep the entire working set of columns in GPU memory.
OmniSci provides two options when your system does not have enough GPU memory available to meet the requirements for executing a query.
The first option is to turn off the watch dog (--enable_watch_dog=0
). That allows the query to run in stages on the GPU. OmniSci orchestrates the transfer of data through layers of abstraction and onto the GPU for execution. See Advanced Configuration Flags for OmniSci Server.
The second option is to set --allow-cpu-retry
. If a query does not fit in GPU memory, it falls back and executes on the CPU. See Configuration Flags for OmniSci Server.
OmniSci is an in-memory database. If your common use case exhausts the capabilities of the VRAM on the available GPUs, try re-estimating the scale of the implementation required to meet your needs. OmniSci can scale across multiple GPUs in a single machine: up to 20 physical GPUs (the most OmniSci has found in one machine), and up to 64 using GPU visualization tools such as Bitfusion Flex. OmniSci can scale across multiple machines in a distributed model, allowing for many servers, each with many cards. The operational data size limit is very flexible.
Why do I keep running out of memory for rendering?
This typically occurs when there is not enough OpenGL memory to render the query results.
Review your Mapd_server.INFO log and see if you are exceeding GPU memory. These appear as EVICTION messages.
You might need to increase the amount of buffer space for rendering using the --render-mem-bytes
configuration flag. Try setting it to 1000000000. If that does not work, go to 2000000000.
How can I confirm that OmniSci is actually running on GPUs?
The easiest way to compare GPU and CPU performance is by using the omnisql
command line client, which you can find at
$OMNISCI_PATH/bin/omnisql.
To start the client, use the command bin/omnisql -p HyperInteractive
, where HyperInteractive is the default password.
Once omnisql
is running, use one of the following methods to see where your query is running:
- Prepend the
EXPLAIN
command to aSELECT
statement to see a representation of the code that will run on the CPU or GPU. The first line is important; it shows eitherIR for the GPU
orIR for the CPU
. This is most direct method. - The server logs show a message at startup stating if OmniSci has fallen back to CPU mode. The logs are in your $OMNISCI_STORAGE directory (default /var/lib/omnisci/data), in a directory named mapd_log.
- After you perform some queries, the
\memory_summary
command shows how much memory is in use on the CPU and on each GPU. OmniSci manages memory itself, so you will see separate columns for in use (actual memory being used) and allocated (memory assigned to omnisci_server, but not necessarily in use yet). Data is loaded lazily from disk, which means that you must first perform a query before the data is moved to CPU and GPU. Even then, OmniSci only moves the data and columns on which you are running your queries.
How do I compare the performance on GPUs vs. CPUs to demonstrate the performance gain of GPUs?
Now, to see the performance advantage of running on GPU over CPU, manually switch where your queries are run:
- Enable timing reporting in
omnisql
using\timing
. - Ensure that you are in GPU mode (the default):
\gpu
. - Run your queries a few times. Because data is lazily moved to the GPUs, the first time you query new data/columns takes a bit longer than subsequent times.
- Switch to CPU mode:
\cpu
. Again, run your queries a few times.
If you are using a data set that is sufficiently large, you should see a significant difference between the two. However, if the sample set is relatively small (for example, the sample 7-million flights dataset that comes preloaded in OmniSci) some of the fixed overhead of running on the GPUs causes those queries to appear to run slower than on the CPU.
Does OmniSci support a single server with different GPUs? For example, can I install OmniSci on one server with two NVIDIA GTX 760 GPUs and two NVIDIA GTX TITAN GPUs?
OmniSci does not support mixing different GPU models. Initially, you might not notice many issues with that configuration because the GPUs are the same generation. However, in this case you should consider removing the GTX 760 GPUs, or configure OmniSci to not use them.
To configure OmniSci to use specific GPUs:
- Run the
nvidia-smi
command to see the GPU IDs of the GTX 760s. Most likely, the GPUs are grouped together by type. - Edit the
omnisci_server
config file as follows:- If the GTX 760 GPUs are
0,1
, configureomnisci_server
with the optionstart-gpu=2
to use the remaining two TITAN GPUs. - If the GTX 760s are
2,3
, add the optionnum-gpus=2
to the config file.
- If the GTX 760 GPUs are
The location of the config file depends on how you installed OmniSci.
What is watchdog, and when should it heel?
Watchdog tries to calculate whether a query can run in a reasonable amount
of time, given the available memory. If you have a query that you know is valid,
but the system is telling you there are insufficient resources, you can turn
watchdog off to give the system the extra time it needs.
To turn watchdog off, set the --enable-watchdog
switch to false
on startup.
See Using Configuration Flags on the Command Line.
What do you mean by binning numbers?
When you use a numeric column as a dimension for your chart, OmniSci automatically groups numbers into bins, gathering numbers by range.
If you turn off auto binning, your chart displays either all values (for example, the Table chart) or a set number of values in ascending or descending order (for example, the Pie chart).
Adjusting the number and range of bins for your chart can make your chart easier to understand at a glance by focusing on the data that is most important to you.
See Dimensions for more bundling and formatting options.
Why am I seeing the error "Distributed support is disabled"?
If you are getting the 'Distributed support is disabled" error even though you have specified a cluster.conf file, check the aggregator log file for 'Cluster file specified running as aggregator with config <<path to config >>' and the leaf log file for 'String servers file specified running as dbleaf with config at << path to config >>'. If those lines are not present in each appropriate configuration file, there is likely an issue with the omnisci.conf server configuration for either the aggregator or the leaves.
Check for an inaccessible cluster.conf file. Ensure that the file has the correct strings (for example, string-servers and not string_servers).
Why am I seeing the error "Too many open files...erno24"?
This can happen in Linux environments where the limit of open files
per user is set too low. OmniSci recommends using systemd
, which
avoids this issue. If you need to set this manually, try setting your nofile
limit to 50,000 as a starting point.
What is causing a TypeError when I try to ingest a file using pandas.read_csv and pymapd.load_table?
If you use pandas.read_csv and pymapd.load_table to attempt to ingest a file that is greater than 2 GB in size, you might see the following exception:
TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array.
To avoid this problem when ingesting large files, set the read_csv
parameter chunksize to a number of rows that is less than 2 GB in size. This chunked approach allows pyarrow
to convert the data sets without error. For example:
df = pd.read_csv('myfile.csv', parse_dates=[0], date_parser=lambda epoch: pd.to_datetime(epoch, unit='s'), chunksize=1000000)
How can I avoid creating duplicate rows?
To detect duplication prior to loading data into OmniSciDB, you can perform the following steps. For this example, the files are labeled A,B,C...Z.
- Load file A into table
MYTABLE
. - Run the following query.
select count(t1.uniqueCol) as dups from MYTABLE t1 join MYTABLE t2 on t1.uCol = t2.uCol;
There should be no rows returned; if rows are returned, your first A file is not unique.
- Load file B into table
TEMPTABLE
. - Run the following query.
select count(t1.uniqueCol) as dups from MYTABLE t1 join MYTABLE t2 on t1.uCol = t2.uCol;
There should be no rows returned if file B is unique. Fix B if the information is not unique using details from the selection.
- Load the fixed B file into
MYFILE
. - Drop table
TEMPTABLE
. - Repeat steps 3-6 for the rest of the set for each file prior to loading the
data to the real
MYTABLE
instance.
Integrations and Connectors
How do I connect to OmniSci using Tableau?
You can connect to OmniSci from Tableau using an ODBC connection. The ODBC connector is included with OmniSci Enterprise Edition, along with a Tableau .tdc file to tailor the ODBC connection between Tableau and OmniSci. Contact support@omnisci.com for assistance and specific instructions for how to connect to OmniSci from Tableau.
Licensing
What am I allowed to do using OmniSci Community Edition?
OmniSci Community Edition is free for non-commercial use.
What are the capabilities of OmniSci Open Source and Community edition?
Open source and Community Edition support as many GPUs as will fit in a single node. They do not support Distributed Systems or server-based rendering.
Tips and Tricks
When should I use dictionary encoding?
The short answer is "whenever possible." OmniSci uses a Dictionary Coder to optimize storage and performance. When you import data with Immerse, OmniSci scans text-based fields for duplicate values. If 70% of the values appear more than once, the field is defined as TEXT ENCODING DICT. A dictionary table stores the common values, while the database stores references to the full values. This can result in huge savings in storage and processing time..
How can I optimize compression?
Use the omnisql
\o
command to output an optimized
CREATE TABLE statement, based on the size of the actual data stored in your
table. See omnisql
How can I ensure that certain columns are ready at startup?
OmniSci uses a smart cache mechanism that caches parts of columns(chunks) into main memory or GPU memory.
You can use the command line argument db-query-list
to provide a
path to a file that contains SELECT queries you want performed at start-up.
Begin the file with the line USER [super-user-name] [database-name]
.
For example:
USER admin omnisci
Specify columns that you want to cache in a WHERE
condition that
does not have any other filters. For example, in an employee database a query
to cache salary might be the following.
select count(*) from employee where salary > 1;
This query attempts to fit entire salary column into hottest parts of cache. Try to keep the query as simple as possible in order to cache the entire column(s).
Note | A query such as
Select * from employee; does not cache anything of interest since there
is nothing to "process." |
See Preloading Data.
When should I use shared dictionaries?
You can improve performance of string operations and optimize storage using shared dictionaries. You can share dictionaries within a table or between different tables in the same database.
See Shared Dictionaries.
When should I use max_rows
?
The max_rows
setting defines the maximum number of rows allowed
in a table. When you reach the limit, the oldest fragment is removed.
This can be helpful when executing operations that insert and retrieve records
based on insertion order. The default value for max_rows
is
2^62.
What are some best practices for inserting data?
When inserting data, it is best to load data in batches rather than loading one row at a time (as you might with a streaming data source). The overhead for loading data is comparatively high for each transaction, regardless of the number of rows you insert.
See Database Design.
Why do my points disappear when I zoom in?
When you zoom in on a Pointmap or Scatter Plot chart, if you have are using the POINT AUTOSIZE feature, the points are reduced in size as you zoom in. This can make your points seem to vanish as they move farther apart. There are a couple of ways to address this:
- Turn off POINT AUTOSIZE to display the POINT SIZE slider. You can adjust the size of the points to make them easier to see.
- On a Pointmap, change the MAP THEME to Dark to make the points stand out.
See Pointmap.