Single Finger Test – CAN errors

Hi everyone!

I’m currently building the TriFinger Robot, and testing a single finger with the single_finger_test.py script. I’ve read through Trouble Running Single Finger Test - TriFinger Robot and this has helped greatly, but I’m still running into some issues.

I was able to run single_finger_test.py once and it seemed to be running smoothly, but after a couple seconds, I received an error much like that shown here: Trouble Running Single Finger Test - TriFinger Robot - #15 by Ross_Worobel However, I tried to run the script again after this initial test (without changing anything to the setup or software), and instead of the expected display/table of joint positions, I get these error messages shown in the photo below.

I also tried initializing the motors manually with start_motors_via_can.sh and monitored the outputs of candump -e can0 and netstat -i.

Some things I’ve noticed are:

  • The CAN messages with ID ‘010’ are initially 1F, but after I actually run single_finger_test.py, the messages change to 54.
  • The blue LEDs on the TI Launchpad initially blink, but after some time, they stop blinking, and instead, remain on.
  • It seems that messages are being dropped, even though they weren’t during the initial run through of the script.

I would appreciate any and all help! Thank you!

Best,
Joah

Here is the output of candump -e can0 for some further insight!

Hi Joah,

Sorry for the late response, I’m just back from a longer leave.
Did you make any progress in the meantime?

Are you using a real-time kernel? If not, you will get timing-related issues very frequently. For the beginning, I suggest you try with the “lowlatency” kernel, which is much easier to install than the PREEMPT_RT kernel. See Real Time Setup — robot_interfaces documentation.

Regarding the “Something went wrong with sending CAN frame” errors

  • Does this happen frequently? We also get this error once in a while. Typically power cycling the motor boards is enough to fix it. If it happens all the time, then it is of course a problem.
  • What CAN hardware are you using to connect the board with the computer?

Status message codes

I added a script decode_can_status_message.py to the firmware repository, which can be used to turn the hex code of the 010 messages into something human-readable. With this, I get:

Status Message: 0x1f

System enabled: 1
Motor 1 enabled: 1
Motor 1 ready: 1
Motor 2 enabled: 1
Motor 2 ready: 1
Error: [0] No error

and

Status Message: 0x54

System enabled: 0
Motor 1 enabled: 0
Motor 1 ready: 1
Motor 2 enabled: 0
Motor 2 ready: 1
Error: [2] CAN receive timeout

So first all is good, then a CAN receive timeout error is triggered (i.e. no new command was sent from the computer for too long), which stops the motors.
I expect this issue can be fixed by using a more real-time capable kernel (see above), assuming you are not already doing this.

LED

The LED blink codes are described here: LED Blink Codes — ODRI Firmware for uDriver Boards 1.0.0 documentation
If the blue LED switches from blinking to always on, this means that the motors got disabled. This seems logical in combination with the error, but I would expect that in this case also the red LED turns on.

1 Like

Hi Felix,

Thanks for your reply! I use the PEAK CAN card (dual channel) that is outlined in the github. I am currently using the low latency kernel and getting the same error — does changing to the PREEMPT_RT kernel make a significant difference? I assumed that since I ran the code the very first time and was able to see the GUI, the low latency kernel would be sufficient for testing purposes.

In terms of the LEDs, the board without errors has a constant green and red light, while the other one with the CAN timeout error has a constant green and blue light (no red) after the initial homing process.

Another note, after some debugging, it seems like even after the homing process finishes, the code gets stuck on the initialize() function, and thus, never starts the GUI. Would you happen to know why this is the case?

Thank you again,
Joah

Another thing I noticed when using the CCS GUI debugger and running in spring mode: the encoder error count for one of the motors is 2, but if i initialize it at a far enough angle, the encoder error count is 0. Additionally, if I run it at a constant torque instead, the error encoder count is always 0. How should I interpret this/what exactly is the encoder error count? Thank you!

For this test lowlatency vs PREEMPT_RT kernel shouldn’t make a significant difference. We ourselves started testing with the lowlatency kernel only recently but from what I found so far it only matters when putting high computational load on the CPU (in which case the PREEMPT_RT kernel will be more robust). For this simple test application I expect the lowlatency kernel to be good enough.

To make sure I understand correctly: Did you already use the lowlatency kernel previously or did you just switch now? In any case, did you do a clean rebuild of the whole workspace (i.e. delete build/ and install/ directories, then rebuild) with the lowlatency kernel running? This is crucial because real-time threads will only be used if a corresponding kernel is detected at build-time. I.e. if you build on a normal kernel, then reboot to the real-time kernel and run the code without rebuilding, it will not use real-time threads.

When it is gets stuck in initialize(), what is the last output you get?

Regarding the LEDs: This might be a regression in the latest firmware, causing the LEDs to be flipped, i.e. they are on when the should be off and off when they should be on. I thought I had fixed this but apparently I didn’t… I’ll check this next time I’m in the lab.

Regarding the encoder error count: This is normal. The error count is the difference (in encoder ticks) between the expected position of the special index tick and the position where it actually is observed. If I remember correctly, there is for some reason a deviation in this order of magnitude (~2 ticks) when switching direction. So you are likely to see this in spring mode, where the motor is moving back and forth but not when constantly spinning in one direction.
An error is only triggered if the error count exceeds what corresponds to one degree of motor motion (=13 ticks with the 5000-lines-encoder), so this small deviation is not a problem.

Hi Felix,

Thank you for your reply. I was using the lowlatency kernel previously (and was able to get demo_fake_finger working using single process time series). I also deleted the build/ and install/ directories and rebuilt them just in case.

In order to see where it gets stuck in initialize(), I added print statements to the code and noticed something interesting. After the messages

waiting for board and motors to be ready
board and motors are ready

The code then goes into the initialize() function (n_joint_blmc_robot_driver.hxx ~line 338) and creates the realtime thread. It then enters the join() function (thread.cpp ~line 180). However, I believe that it never exits this function. So, it continues to run join() and thus, never exits or finishes initialize().

After realtime_thread.join() is called, homing begins and finishes successfully (and this happens in _initialize() I believe). After homing, I am able to get inside the if (homing_succeeded){ } statement, which indicates that homing has succeeded. Then, calling Vector waypoint = get_latest_observation().position is also run. However, after that, it seems like nothing happens. Perhaps the code is getting stuck in a process that occurs in the background that I am not aware of? I’m not too sure. I’m attaching the _initialize() code along with the print statements that I added in order to show exactly where it stops working. The output in the terminal when I run ros2 run robot_fingers single_finger_test cuts off at d (in other words, it executes line 670, but not 672).
Please let me know if you know why this would happen. Thank you!

Best,
Joah

One thing I noticed: You didn’t add \n to the end of the prints in lines 672 and 675. Not sure if this is the case but it might be that due to this prints are kept in a buffer and are not directly flushed to the output. Can you add newlines to those prints and try again?

However, this of course doesn’t explain why it is hanging somewhere. Unfortunately I don’t have an idea right now…
Can you post the single_finger_test.yml file you are using (in robot_fingers/config/)? Maybe the number of “move_steps” was by accident set to a very high value or something like this.

I updated the MotorWare-patch in udriver_firmware with a fix for the LEDs. There was indeed a bug that the status of the LEDs was inverted (i.e. they were off when they should be on and vice versa). With the updated patch, this should be fixed now.

1 Like

Hi Felix,

After adding \n to the print statements, they are indeed printed. However, the for loop only runs for 1 iteration (i = 0), and never loops over:

Here is the single_finger_test.yml file that I am using:
image

It seems to match up with the version currently on github, with the exception in homing_method, but I assumed that would not matter. The move_steps is set to 500. Is this considered too high?

Another thing I’ve noticed when inspecting candump -e is that can0 will first receive that CAN receive timeout error message, and shortly after, can1 also receives this error. This is the case every time. Do you know why this would happen?

Thanks so much!
Joah

Additionally, I added another print statement after line 676

waypoint[i] = config_.initial_position_rad[i];

and confirmed that it executes this line, so it seems like the code is getting hung up on line 678

reached_goal = move_to_position(waypoint, config_.move_to_position_tolerance_rad, config_.calibration.move_steps);

Also, one quick question: why is only waypoint[i] modified? Why can’t we just use

Vector waypoint = get_latest_observation().position

directly?

No, 500 is totaly fine.

Do you mean why can0 is always first, immediately followed by can1? I assume this is because the communication with the boards is done in a loop, sending commands first to can0, then to can1. If the communication is delayed, the last received command will therefore be a little bit older on can0, thus also triggering the timeout a bit earlier.


I know it is a bit tedious but can you also add prints in the move_to_position method to further narrow it down?


Regarding the re-assignment of waypoint: The purpose of the code is to move the robot to the configured initial position (given in config_.initial_position_rad) but moving only one joint after the other (this helps avoiding collisions on some of our robots). The move_to_position method only accepts a target position for the whole robot. So to achieve the desired behaviour the goal position (aka waypoint) is first initialised to the current position of the robot and then one joint after another the goal position is overwritten with the actually desired one, calling move_to_position in each step.

After adding print statements to move_to_position it seems like the code gets stuck in the for loop that iterates over time_steps (lines 882-897). The for loop runs, and while it is running, the CAN receive timeout error message is received.

Do you have an idea of why this may be? I can also try to add print statements to apply_action_uninitialized(), but I don’t know if this will necessarily be helpful because it seems like it is being executed at every iteration.

Thank you for the clarification on my other questions, your answers helped!

Additionally, in apply_action_uninitialized(), I’ve tried changing the increment of 0.001 in line 647 to 0.1 just in case the lowlatency kernel had issues with such a fast time increment. However, this too did not help with the timeout issue.

When/how does it get stuck in the loop? Is it running first, until the timeout error comes and then gets stuck?
If this is the case, can you add some timing code to the loop to measure the time spent between consecutive calls of apply_action_uninitialized (to see if there is some unexpected delay)?

The timeout error is triggered by the board if no new command is received for 10 ms.
Increasing the sleep inside apply_action_uninitialized to 0.1 s will not help but actually trigger the timeout immediately as it increases the time between two commands to more than the timeout.

Hi Felix,

Sorry for the delayed response. There were some issues with the PC on my end, so I ended up having to reboot in a new kernel, and this seems to have fixed the issues! I’m not entirely sure why this is the case, but single_finger_test.py now works!

Thank you for all of your help!

2 Likes

Great to hear that, you’re welcome!

1 Like